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Abstract 

This  research  develops  a  general  methodology  for  designing  neural  network  classi¬ 
fiers  for  real-world  environmental  problems.  This  methodology  is  demonstrated  through 
the  design  of  a  multi-layer  perceptron  to  classify  stainless  steel  and  actinide  samples.  This 
research  provides  techniques  for  selecting  architecture  and  training  parameters,  choosing 
the  number  of  training  epochs,  reducing  the  feature  sets,  and  evaluating  classifier  per¬ 
formance.  For  the  stainless  steel  data,  the  feature  set  is  reduced  from  196  features  to  54 
features  and  the  average  training  set  error  and  test  set  error  are  .6%  and  5.5%,  respectively. 
The  best  results  attained  on  the  actinide  data  set  are  24%  training  set  error  and  26%  test 
set  error.  The  actinide  feature  set  is  reduced  from  18  feature  to  9  features.  The  products 
a  concise  methodology  for  developing  neural  network  classifiers  and  the 
a  multi-layer  perceptron  classifier  for  each  of  the  data  sets. 
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Neural  Network  Classification  of  Environmental  Samples 


I.  Introduction 

1.1  Background 

Like  many  Air  Force  organizations,  the  Air  Force  Technical  Applications  Center 
(AFTAC)  has  a  mission,  insuring  Nuclear  Test  Ban  Treaty  compliance,  which  involves 
monitoring  of  the  environment.  This  environmental  monitoring  often  requires  identification 
of  unknown  chemical  compounds  and,  for  AFTAC,  often  requires  determining  the  origin 
of  environmental  samples. 

The  traditional  method  used  to  identify  environmental  samples  is  to  determine  the 
individual  components  of  the  sample,  thereby  elucidating  the  identity  of  the  sample  it¬ 
self.  The  field  of  analytical  chemistry  has  conceived  numerous  instrumental  methods  to 
identify  the  constituents  of  a  sample  including  mass  spectrometry,  raman  spectroscopy, 
fluorescence  spectroscopy,  and  gas  chromatography  [16].  Each  of  these  methods  produces 
spectra  from  which  the  relative  quantities  of  the  sample’s  constituents  may  be  determined. 
Consequently,  these  quantities  may  be  used  to  classify  the  sample  by  comparison  to  those 
of  known  samples.  However,  the  task  of  classifying  a  sample  given  these  quantities  is  often 
complicated  by  similarity  of  samples  and  the  large  number  of  known  chemical  compounds. 
Consequently,  automated  methods  of  classifying  environmental  samples  are  necessary. 

Neural  networks,  sometimes  called  artificial  neural  networks,  have  been  shown  ca¬ 
pable  of  classifying  complex  patterns  such  as  those  mentioned  above  [4].  Artificial  neural 
networks  are  physiologically  motivated  computer  algorithms  which  attempt  to  mimic  the 
function  of  the  large  interconnected  network  of  neurons  in  the  human  brain,  which  has 
extraordinary  pattern  recognition  capabilities[23].  These  artificial  neural  networks  learn 
to  map  a  set  of  input  features,  elemental  composition,  onto  a  set  of  outputs  such  as  a 
binary  node  whose  output  (1  or  0)  represents  steel  or  not  steel.  For  this  reason,  neural 
networks  may  be  used  to  classify  the  given  environmental  data. 
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Numerous  efforts  have  illustrated  the  use  of  neural  networks  for  classifying  chemical 
spectra.  A  thorough  review  of  these  efforts  is  provided  by  Burns  and  Whitesides  [4].  How¬ 
ever,  the  efforts  focus  primarily  on  the  chemical  analyses  and  the  neural  network  results, 
rather  than  the  methodology  for  designing  the  neural  network  classifiers.  In  addition, 
classifiers  are  unique  to  the  data  being  classified  and  must  be  designed  specifically  for  the 
given  data.  Furthermore,  all  efforts  thus  far  have  used  the  analytical  spectra  as  input 
features  rather  than  the  relative  quantities  which  may  be  derived  from  the  spectra,  and 
no  effort  has  been  made  to  correlate  samples  to  the  location  from  which  they  were  taken. 
As  a  result,  this  thesis  focuses  on  classification  by  name  and  by  location  using  elemental 
percentages  (by  weight)  and  radioactive  particle  percentages  (by  atom)  as  input  features. 

1.2  Problem  Statement 

A  methodology  for  the  design  of  neural  network  classifiers  for  environmental  appli¬ 
cations  is  developed  and  is  demonstrated  through  the  classification  of  stainless  steel  and 
actinide  samples  given  the  relative  quantities  of  the  constituent  elements  and  particles  of 
each  sample. 

l.S  Scope 

Elemental  percentage  by  weight  and  radioactive  particle  percentage  by  atom  for  the 
stainless  steel  samples  and  radioactive  particle  percentages  alone  for  the  actinide  samples 
have  been  provided  by  the  Air  Force  Technical  Applications  Center.  A  neural  network  is 
designed  to  classify  this  data  and  the  salient  features  (features  most  useful  in  classification) 
are  determined.  Once  designed,  the  performance  of  the  classifiers  is  evaluated  by  estimating 
the  Bayes  error  bounds.  While  the  results  obtained  herein  are  unique  to  this  particular 
data  set,  the  methodology  is  general  enough  to  be  applied  to  other  environmental  data 
sets. 

1.4  Research  Objectives 

The  research  objectives  for  this  thesis  are  as  follows: 


1-2 


I 


1.  Provide  a  methodology  for  developing  neural  network  classifiers  for  environmental 
applications. 

2.  Demonstrate  the  ability  of  neural  networks  to  classify  real-world  environmental  data. 

3.  Design  a  neural  network  to  classify  the  given  data  sets. 

4.  Determine  the  salient  features  for  the  given  data  sets. 

5.  Evaluate  the  performance  of  each  classifier. 

1.5  Approach 

The  approach  taken  in  this  effort  consists  of  several  distinct  steps.  Initially,  the  data 
is  preprocessed.  This  preprocessing  consists  of  several  steps  including  normalizing  the  data. 
The  second  step  is  to  develop  a  multi-layer  perceptron  for  classifying  the  stainless  steel  and 
actinide  data  sets.  This  includes  selection  of  architecture  and  parameter  values  as  well  as 
training  and  testing  the  classifier  to  achieve  an  acceptable  level  of  performance.  Following 
testing  of  the  multi-layer  perceptron,  a  technique  known  as  forward  sequential  selection 
is  used  to  rank  the  input  features  in  the  order  of  decreasing  importance  to  classification. 
Then,  another  classifier  is  trained  and  tested  using  subsets  of  the  input  features  in  order 
to  enhance  the  performance  of  the  classifier.  Finally,  the  performance  is  evaluated  by 
comparing  the  classifier  error  to  Bayes  error  bounds. 

1.6  Thesis  Overview 

The  remainder  of  this  thesis  is  organized  as  follows:  Chapter  II  provides  an  overview 
of  the  neural  networks  employed  in  this  research.  Chapter  III  describes  the  implementation 
and  evaluation  of  these  networks  and  presents  the  results  of  this  effort .  Finally,  a  summary 
of  these  results  and  conclusions  are  presented  in  Chapter  IV. 
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II.  Theory 


2. 1  Introduction 

The  purpose  of  this  chapter  is  to  provide  an  introduction  to  pattern  recognition  and 
neural  networks,  and  to  overview  the  theory -necessary  to  understand  the  methods  presented 
in  Chapter  III.  Only  concepts  relevant  to  this  research  are  covered  in  this  chapter.  The 
topics  covered  in  the  following  sections  are: 

•  Introduction  to  Pattern  Recognition 

•  The  Single  Perceptron 

•  The  Multi-Layer  Perceptron 

•  Feature  Selection 

•  Bayes  Error  Bounding 

2.2  Introduction  to  Pattern  Recognition 

According  to  Duda  and  Hart,  pattern  recognition  is  the  assignment  of  an  object  to 
any  of  several  categories  based  on  measurements  of  the  object’s  features.  More  specifically, 
pattern  recognition  is  selecting  the  actual  state  of  nature  w  of  an  object  from  a  set  of 
possible  states  a;,-  based  on  some  measurement  x  [9]. 

The  process  of  selecting  the  class  of  an  object  based  on  some  analog  measurement  of 
that  object  amounts  to  drawing  a  decision  boundary  in  the  feature  space  which  separates 
the  members  of  competing  classes.  For  example,  suppose  that  indoor  air  samples  are  to 
be  classified  as  hazardous  or  non-hazardous  based  on  average  particle  size  Xi.  If  a  number 
of  samples  for  which  the  class  memberships  are  known  are  plotted  in  the  feature  space, 
the  optimal  decision  boundary  is  the  line  which  absolutely  separates  the  two  classes  as 
shown  in  Figure  2.1a.  If,  however,  the  samples  are  to  be  classified  based  on  both  average 
particle  size  Xi  and  radon  concentration  X2,  the  decision  boundary  must  be  drawn  in  a 
two-dimensional  feature  space  as  shown  in  Figure  2.1b. 
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Figure  2.1  Two-class  classification  problem:  (a)  one-dimensional  case  (b)  two- 
dimensional  case. 


Once  the  decision  boundary  has  been  set,  any  subsequent  samples  of  unknown  class 
membership  are  then  classified  according  to  their  position  in  the  feature  space  relative  to 
the  decision  boundary. 

This  is  a  grossly  oversimplified  representation  of  real-world  pattern  recognition  prob¬ 
lems  for  several  reasons.  Classification  of  objects  will  usually  be  based  on  multiple  features. 
In  general,  x  will  be  a  vector  containing  measurements  of  multiple  features.  In  this  thesis 
for  example,  each  sample  is  represented  by  a  vector  containing  the  percentages  of  each 
periodic  element  and  the  percentages  of  certain  radioisotopes  which  compose  the  sample. 
Furthermore,  it  may  be  impossible  to  draw  a  decision  boundary  which  completely  separates 
the  class  data.  In  reality,  the  boundary  is  drawn  such  that  the  probability  of  classification 
error  is  minimized.  This  concept  is  discussed  further  in  section  2.6.  The  boundary  may 
also  be  more  complex  (consisting  of  multiple  lines  or  non-linear  boundaries)  in  many  cases 
such  as  non-linearly  separable  problems  and  multi-class  problems.  The  classic  example  of 
non-linearly  separable  data  is  the  XOR  problem  shown  in  Figure  2.2a.  Clearly,  the  two 
classes  of  data  cannot  be  separated  by  a  single  line  rather  it  requires  a  complex  decision 
boundary  as  shown.  Similarly,  Figure  2.2b  shows  a  three-class  problem  which  also  requires 
a  complex  boundary.  As  the  dimensionality  (the  number  of  features)  of  a  classification 
problem  or  the  number  of  classes  grow,  the  decision  boundary  becomes  even  more  complex 
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consisting  of  surfaces,  rather  than  lines,  in  the  feature  space.  In  this  case,  sophisticated 
algorithms  must  be  used  to  derive  decision  boundaries. 


(a)  (b) 


Figure  2.2  Complex  decision  boundaries:  (a)  XOR  data  (b)  three-class  data. 


Automated  pattern  classification  may  be  accomplished  using  machines  or  algorithms 
known  as  pattern  classifiers,  or  simply  classifiers.  Further,  automated  pattern  recognition 
consists  of  two  distinct  stages:  determining  the  decision  boundaries  based  on  existing 
data  (training  the  classifier),  and  classifying  new  data  (testing  the  classifier).  Numerous 
statistical  classifiers  exist  which  utilize  the  probability  distributions  associated  with  the 
state  of  nature  p(a;,),  and  the  measurement  x,  p{x),  and  the  conditional  probability 
distributions  p{ui\x)  to  determine  the  decision  boundaries.  An  overview  of  Bayes  Decision 
Rule,  which  is  the  fundamental  basis  of  statistical  classifiers,  is  provided  in  section  2.6.  For 
a  thorough  treatment  of  statistical  classification  methods,  the  reader  is  referred  to  Duda 
and  Hart  [9]. 

Artificial  neural  networks  may  also  be  employed  for  pattern  classification  problems. 
These  classifiers  were  precipitated  by  the  realization  that  animals,  especially  humans,  have 
the  ability  to  rapidly  classify  complex  patterns.  This  led  to  much  research  in  the  1950’s 
and  1960’s  out  of  which  came  mathematical  models  which  were  intended  to  elucidate  the 
function  of  the  human  brain  (an  overview  of  these  research  efforts  is  given  by  Rogers  et 
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al  [23]  and  by  Lippmann  [17]).  The  value  of  these  models  as  problem-solving  machines 
soon  became  apparent  and  the  study  of  pattern  recognition  branched  into  two  subfields, 
the  study  of  human  and  animal  pattern  recognition  and  the  development  of  mathematical 
analogs  of  biological  systems  capable  of  pattern  recognition  [31].  The  latter  has  become  a 
large  field  concerned  with  the  development  of  artificial  neural  network  architectures  and 
learning/training  algorithms. 

Although  many  different  architectures  and  algorithms  have  been  developed  [15,  6, 12], 
the  most  popular  is  the  multi-layer  perceptron  and  its  various  training  algorithms.  This 
popularity  is  due  to  its  ease  of  implementation  and  its  ability  to  represent  any  decision 
boundary[8].  Furthermore,  the  multi-layer  perceptron  is  a  Bayes  optimal  classifier  (see 
Section  2.6),  which  means  it  cannot  be  outperformed,  on  average,  by  other  types  of  clas¬ 
sifiers  on  the  same  data  set  [26].  For  these  reasons,  the  multi-layer  perceptron  is  used 
throughout  this  research. 

2.3  The  Perceptron 

The  perceptron,  which  was  introduced  by  Rosenblatt  in  1959  [24],  is  the  simplest 
neural  network  and  is  the  basic  functional  unit  of  the  more  complex  multi-layer  perceptron 
that  will  be  used  in  Chapter  III.  The  single  perceptron  may  be  used  to  solve  two-class 
problems  in  which  the  class  data  is  linearly  separable.  This  section  covers  the  architecture 
and  operation  of  the  single  perceptron  and  provides  an  overview  of  the  perceptron  training 
algorithm. 

2.3.1  Architecture  and  Operation.  The  function  of  the  perceptron  (Figure  2.3)  is 
analogous  to  that  of  the  biological  neuron  [23].  The  perceptron  computes  a  weighted  sum 
of  its  inputs. 
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Figure  2.3  Single  perception. 


More  specifically,  the  perception  multiplies  each  of  its  inputs,  the  input  features  xo  ..  .xj 
and  a  bias  xj^i,  by  its  respective  weight  to,-  and  computes  the  sum.  The  sum  may  then 
be  transformed  by  an  activation  function,  which  forces  the  output  between  a  high  value 
(typically  1)  and  a  low  value  (typically  -1  or  0).  The  bias  is  typically  set  equal  to  1 
resulting  in  a  bias  term  in  the  sum  which  is  simply  the  weight  This  calculation  may 
be  represented  mathematically  as 

I 

y  =  +  ^^+l)  (2-1) 

j=l 

or 

y  =  f{YjWiXi)  (2.2) 

t=l 

where  /(•)  is  the  activation  function  and  /  +  i  is  the  number  of  input  features  including 
the  bias  value  of  1. 
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f(x) 


Hypofboifc  Tangent 


Figure  2.4  Activation  functions. 

While  any  number  of  functions  may  be  used,  the  activation  functions  typically  asso¬ 
ciated  with  the  single  perceptron  are  shown  in  Figure  2.4.  The  mathematical  formulations 
of  these  functions  are  as  follows: 

linear:  f{x)  =  x 
sigmoid:  f{x)  = 

hyperbolic  tangent:  f{x)  =  tanh{x) 

2.3.2  Training.  The  single  perceptron  may  be  trained  to  classify  input  vectors 
in  a  two-class  problem.  This  training  is  accomplished  by  randomly  initializing  the  weights 
to  some  small  value  (usually  in  the  range  [-0.5, 0.5]),  presenting  class-labeled  input  vectors 
to  the  perceptron  and  adjusting  the  weights  such  that  the  output  of  the  perceptron  tends 
toward  1  for  members  of  class  uj\  and  toward  0  or  -1,  depending  on  the  activation  func¬ 
tion,  for  members  of  class  U2.  The  error  between  the  perceptron  output  and  the  desired 
perceptron  output  is  used  to  accomplish  this  weight  adjustment.  Mathematically  this  is 
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represented  by 

+  -  dE 

where  w"'"  represents  the  adjusted  weight  vector,  w~  is  the  previous  weight  vector,  77  is  a 
weight  adjustment  factor  commonly  referred  to  as  the  learning  rate,  and  E  is  the  error. 
Derivation  of  the  functional  form  of  the  weight  update  is  done  only  for  the  multi-layer 
perceptron  (Appendix  A).  The  learning  rate  is  a  value  in  the  range  (0,1).  All  of  the 
vectors  in  the  training  data  set  are  repeatedly  presented  to  the  network,  and  the  weights 
are  adjusted  until  a  suitable  level  of  error  is  reached.  Each  pass  through  the  training  data 
set  is  known  as  an  epoch. 

Although  the  perceptron  is  easy  to  implement  and  train  using  this  algorithm,  it  is 
limited  to  two-class  problems  in  which  the  data  are  linearly  separable.  To  overcome  this 
limitation,  a  more  complex  structure  must  be  implemented,  the  midti-layer  perceptron. 

2.4  The  Multi-Layer  Perceptron 

The  multi-layer  perceptron  is  the  most  frequently  implemented  of  all  neural  network 
architectures.  Unlike  the  single  perceptron  which  only  draws  a  single  line  or  plane  in 
the  feature  space,  the  multi-layer  perceptron  is  capable  of  drawing  complex  surfaces  to 
separate  class  data.  The  remainder  of  this  section  provides  an  overview  of  the  architecture, 
operation,  and  training  of  the  multi-layer  perceptron. 

2.4-1  Architecture  and  Operation.  The  multi-layer  perceptron  is,  as  the  name 
implies,  nothing  more  than  two  or  more  layers  of  perceptrons  as  shown  in  Figure  2.5. 
Although  multi-layer  perceptrons  may  consist  of  more  than  two  layers,  it  has  been  shown 
that  two  layers  are  sufficient  for  any  arbitrary  classification  problem[8]. 
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Figure  2.5  Multi-layer  perceptron. 


Each  perceptron,  or  node,  in  the  first  layer,  often  referred  to  as  the  hidden  layer, 
computes  a  weighted  sum  of  the  inputs  (as  described  in  Section  2,3.1)  using  its  own  weight 
vector.  The  weight  vectors  of  the  hidden  layer  nodes  comprise  the  weight  matrix  w^.  This 
operation  is  represented  by 

/+1 

Vj  ^  (2-4) 


t=l 


or,  in  matrix/vector  notation, 


y  =  fj(wix') 


(2.5) 


where  x’  is  the  transpose  of  x,  and  fj  is  the  transformation  function  for  the  hidden  layer 
nodes.  Similarly,  each  output  layer  node  computes  a  weighted  sum  of  the  hidden  layer 
outputs  and  a  bias  term,  yi . . .  yj+i ,  producing  the  multi-layer  perceptron  output  vector 
z.  The  weight  vectors  for  the  output  layer  nodes  are  given  by  w^.  This  operation  is 
represented  by 


Zk 


J+i 

=  fkiYl  ^hvj) 


j=i 


(2.6) 


or,  in  matrix/vector  notation. 


z  =  fk(w2y') 


(2.7) 
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where  y’  is  the  transpose  of  y,  and  fk  is  the  transformation  function  for  the  output  layer 
nodes. 

The  activation  functions  used  with  the  single  perceptron  are  also  used  in  multi¬ 
layer  architectures  with  all  nodes  in  a  layer  employing  the  same  function.  For  example, 
all  hidden  layer  nodes  may  use  the  sigmoid  function,  while  all  output  layer  nodes  use 
a  linear  transformation,  sigmoid-linear.  Although  many  different  combinations  may  be 
used,  common  combinations  of  activation  functions  are  sigmoid-sigmoid,  sigmoid-linear, 
tanh-tanh,  tanh-linear. 

In  designing  a  multi-layer  perceptron  for  a  particular  problem,  several  parameters 
must  be  chosen.  Among  them  are  the  number  of  hidden  layer  nodes  ( J),  the  number  of 
output  layer  nodes  K,  and  the  activation  functions  to  be  employed.  While  K  is  usually  set 
to  the  number  of  classes  in  the  problem  so  that  each  node  represents  a  class,  J  is  somewhat 
arbitrary.  A  common  rule  of  thumb  is  that  the  number  of  hidden  layer  nodes  should  be 
selected  such  that  the  number  of  samples  is  at  least  10  times  the  number  of  weights  in 
the  network  [2].  In  addition,  the  activation  functions  must  also  be  selected.  Selection  of 
architecture  parameters  for  this  research  are  covered  in  Chapter  III. 

2.4-2  Training.  As  with  the  single  perceptron,  training  is  accomplished  by  ini¬ 
tializing  the  weight  matrices  to  some  small  value,  presenting  the  class  labeled  input  vectors, 
and  adjusting  the  weight  matrices  until  a  suitable  level  of  error  is  reached. 

The  multi-layer  perceptron  is  most  often  trained  using  the  backpropagation  algorithm 
which  was  independently  discovered  by  three  different  researchers  [32,  20,  28].  Although  nu¬ 
merous  efforts  have  been  made  to  improve  the  backpropagation  algorithm(backpropagation 
with  momentum[28],  conjugate  gradient[29],  and  newton’s  method[3],  to  name  a  few),  the 
addition  of  momentum  is  the  least  difficult  improvement  to  implement  and  is,  therefore, 
the  most  common.  The  complexity  of  the  latter  two  algorithms  and  their  intense  compu¬ 
tation  time  often  eclipse  their  improvement  in  generalization  (ability  to  classify  data  which 
was  not  used  in  training)  and  convergence  time  (number  of  epochs  required  to  reach  an 
acceptable  level  of  error)[l].  For  this  reason,  backpropagation  with  momentum  is  used  to 
train  the  multi-layer  perceptrons  implemented  throughout  this  research. 
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Backpropagation  adjusts  the  weights  based  on  the  error  between  the  desired  output 
of  the  network  and  the  actual  output  for  each  input  vector  in  the  training  data  set.  The 
most  common  measure  of  this  error  is  the  sum-squared  error  which  is  given  by 


K 


A;=l 


(2.8) 


where  dk  is  the  desired  output,  and  Zk  is  the  actual  output. 


Training  may  be  conducted  in  one  of  two  modes,  instantaneous  or  batch.  In  instan¬ 
taneous  training,  the  weights  are  adjusted  after  the  presentation  of  each  input  vector.  The 
general  weight  update  rule  for  such  training  is 


W+  =  W 


dE 


-V 


dw- 


(2.9) 


where  W+  is  the  updated  weight  matrix,  W“  is  the  old  weight  matrix,  and  rj  is  the 
learning  rate.  The  learning  rate  is  a  variable  which  may  assume  values  between  zero  and 
one.  The  learning  rate  may  remain  constant  throughout  training  or  it  may  be  varied  in 
this  range  during  training.  Algorithms  which  vary  the  training  parameters  during  training 
are  known  as  adaptive  algorithms  [7].  When  performing  batch  training,  each  input  vector 
is  presented  and  the  update  portion  of  the  equation  is  computed  for  each  input  vector. 
However,  the  weights  are  not  updated  after  each  input  vector  presentation.  An  average 
update  is  computed  over  all  of  the  input  vectors,  and  the  weights  are  updated  at  the  end 
of  the  epoch.  In  batch  mode,  the  general  update  rule  becomes 


W+  =  W- 

n  ^  'dW- 

t=l 


(2.10) 


where  n  is  the  number  of  input  vectors.  For  ease  of  notation,  the  remainder  of  the  weight 
update  equations  in  this  chapter  are  shown  in  their  instantaneous  form.  These  equations 
may  be  transformed  to  their  batch  form  by  simply  averaging  the  update  portion  of  the 
equation  over  all  of  the  input  vectors. 

In  either  mode,  backpropagation  training  is  simply  a  gradient  descent  which  seeks  the 
minimum  in  the  weight  space  error  surface  by  adjusting  the  weights  in  a  negative  gradient 
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direction.  Figure  2.6  shows  an  error  surface  in  a  one- dimensional  weight  space,  while 
Figure  2.7  illustrates  an  error  surface  in  a  two  dimensional  weight  space.  Note  that  as  the 
number  of  weights  increases  beyond  two,  which  is  the  case  for  any  problem  of  significance, 
the  error  surface  becomes  impossible  to  visualize. 


-.5  0  -.5 


Figure  2.6  Error  surface  in  one-dimensional  weight  space. 
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Figure  2.7  Error  surface  in  two-dimensional  weight  space. 


It  is  desirable  for  the  algorithm  to  adjust  the  weights  such  that  the  most  direct  path 
to  the  error  surface  minimum  is  taken.  However,  the  error  surface  descent  is  dependent  on 
the  learning  rate  77.  As  a  result,  the  algorithm  may  skip  back  and  forth  over  the  minimum, 
increasing  the  time  required  to  converge.  This  oscillation  may  be  dampened  by  adding 
a  momentum  term  which  is  a  constant  a  multiplied  by  the  previous  weight  update  AW. 
The  momentum  may  be  constant  during  training  or  may  be  adapted  during  training  like 
the  learning  rate. 

BV 

w+ =  W- -  77— —  +  aAW  (2.11) 

This  has  the  effect  of  increasing  the  weight  update  when  moving  down  the  error  surface  or 
decreasing  the  update  when  moving  up  the  error  surface.  This  improves  backpropagation 
learning  by  allowing  a  more  smooth  and  timely  descent  on  the  error  surface. 

The  complexity  of  the  backpropagation  algorithm  comes  in  deriving  the  derivative 
of  the  error  with  respect  to  the  weights  for  the  hidden  layer  and  the  output  layer  to  arrive 
at  a  functional  form  of  the  weight  update  rule.  The  generic,  functional  form  of  the  output 
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layer  weight  update  is 

^/ofco  =  %"ofco  +  -  fko  )fko  fk  (2.12) 

,  while  the  hidden  layer  update  is  given  by 

K 

^tk  =  ^iok  +  ^  -  fk)fkWj,kfj,  Xi,  (2.13) 

jt=i 

Note  that  the  outputs  of  the  hidden  layer  and  the  output  layer  are  denoted  by  the  trans¬ 
formation  functions  fj  and  fk  respectively;  is  the  derivative  of  the  activation  function 
evaluated  at  node  ko.  The  general  derivations  of  the  hidden  layer  and  output  layer  updates 
are  given  in  Appendix  A  along  with  the  transformation  function  specific  derivations.  The 
momentum  term  (not  shown  here  or  in  the  derivations)  is  simply  the  difference  between 
the  weight  matrix  from  the  previous  input  presentation  and  the  current  weight  matrix 
multiplied  by  a. 

2.5  Feature  Selection 

One  of  the  most  important  tasks  in  classifier  design  is  selecting  the  appropriate 
feature  set  for  classification.  This  is  critical  for  several  reasons.  According  to  the  "Curse 
of  Dimensionality”  posited  by  Foley,  a  larger  number  of  input  features  requires  a  larger 
number  of  training  samples  [10].  In  addition,  a  larger  feature  set  increases  the  number  of 
weights  that  must  be  used  when  employing  neural  network  classifiers.  Both  a  larger  training 
set  and  an  increased  number  of  weights  lead  to  longer  training  times.  Finally,  Kabrisky 
has  suggested  that  there  is  an  optimal  number  of  features  in  terms  of  error  for  some  data 
sets  [13].  Ultimately,  the  goal  of  feature  selection  is  to  reduce  the  feature  set  without  a 
significant  decrease,  if  any,  in  the  accuracy.  Many  techniques  exist  for  determining  the 
saliency  (importance  for  classification)  of  input  features  [27,  21,  30,  21]  and,  subsequently, 
for  reducing  the  feature  set.  The  method  used  for  feature  reduction  in  this  research  is 
forward  sequential  selection. 

Forward  sequential  selection  is  commonly  used  to  rank  features  according  to  their 
saliency.  This  techniques  begins  by  training  a  classifier  for  each  input  feature  from  the  set 
of  candidate  features.  The  feature  which  results  in  the  highest  classification  accuracy  is 
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added  to  a  feature  nucleus  which  is  initially  empty,  and  the  feature  is  removed  from  the 
set  of  candidate  features.  Subsequently,  a  classifier  is  trained  using  the  nucleus  and  each 
feature  from  the  candidate  set.  Again,  the  feature  whose  addition  to  the  nucleus  results 
in  the  best  classification  accuracy  is  added  to  the  nucleus  and  removed  from  the  candidate 
set.  This  process  is  repeated  until  the  desired  number  of  features  have  been  ranked  or  the 
candidate  feature  set  is  empty. 


2.6  Introduction  to  Bayes  Error  Bounding 

Bayes  error  is  the  minimum  average  probability  of  error  associated  with  any  classi¬ 
fication  problem  [9].  This  means  that  Bayes  error  represents  the  best  performance  that  a 
classifier  may  achieve  on  average.  As  such,  it  is  the  benchmark  of  performance  for  statis¬ 
tical  classifiers,  and  any  classifier  with  an  error  rate  approaching  Bayes  error  is  considered 
to  be  Bayes  optimal.  Consequently,  Ruck  et  al  have  shown  that  a  multi-layer  perceptron 
trained  using  sum-squared  error  as  the  error  measurement  approximates  Bayes  discrimi¬ 
nation  [27].  Thus,  Bayes  error  is  also  used  to  evaluate  the  performance  of  the  multi-layer 
perceptron  classifier. 

The  remainder  of  this  section  discusses  Bayes  Decision  Theory  to  the  extent  necessary 
to  understand  Bayes  error  and  describes  methods  for  estimating  Bayes  error. 


2.6.1  Bayes  Decision  Theory.  The  stated  purpose  of  pattern  classifiers  is  to 
determine  objects’  classes,  cj,  based  on  some  feature  or  set  of  features,  x,  such  that  the 
probability  of  misclassification  (error)  is  minimum.  According  to  Bayes  Decision  Theory, 
the  minimum  probability  of  error  is  achieved  by  selecting  the  class  w,-  which  has  the  high¬ 
est  probability  given  the  measurement  x,  P(a>,jx).  According  to  Bayes  theorem,  the  a 
posteriori  probability  P{u)i\x)  is  given  by 


P{u;i\x) 


p{x\u>i)P{oJi) 

p{x) 


(2.14) 


where  p(x\oJi)  is  class  conditional  probability  of  x  ,  P(wj)  is  the  a  priori  probability  of  class 
Ui,  and  p(x)  is  the  probability  density  function  (pdf)  for  the  measurement  x.  The  class 
conditional  probability  of  x  is  the  probability  of  observing  values  of  x  given  a  specific  class 
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Ui  and  the  a  priori  probability  of  class  Wi  represents  the  proportion  of  the  total  number 
of  objects  in  the  sample  set  which  belong  to  class  a;,-.  As  an  example  consider  a  two  class 
problem  in  one  dimension.  Bayes  decision  rule  becomes 

select  u-i  if  P(a;ila;)  >  P{ijJ2\x)  otherwise  select  UI2 

As  shown  in  Figure  2.8a,  Bayes  decision  rule  produces  a  boundary  at  the  intersection 
of  the  two  distribution  which  results  in  the  area  under  the  two  curves  (the  error)  being 
minimized.  Clearly,  any  other  boundary  will  produce  higher  probability  of  error  as  shown 
in  Figure  2.8b. 


(a)  (b) 

Figure  2.8  Probability  distributions  with  decision  boundaries:  (a)  optimal  boundary  set 
by  Bayes  decision  rule  (b)  non-optimal  boundary. 


As  stated,  there  is  some  probability  of  error  associated  with  selecting  the  class  a?,- 
given  X.  Because  error  is  define  as  choosing  the  wrong  state  of  nature  w,  the  probability 
of  error  is  the  sum  of  the  a  posteriori  probabilities  of  the  states  of  nature  not  chosen.  For 
example  in  a  two-class  problem,  the  probability  of  error  given  x  is  given  by 

{P(a;i  I  x),  if  u}2  is  chosen; 

P(a>2  I  x),  if  0)1  is  chosen. 

The  average  probability  of  error  is  calculated  over  the  entire  range  of  x  as  follows: 

r+00 

P[error)  —  /  P{erTor\x)p{x)dx  (2.15) 
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This  represents  Bayes  error  when  the  decision  boundary  is  chosen  according  to  Bayes 
decision  rule. 

2.6.2  Bounding  Bayes  Error.  Because  the  sample  set  for  any  classification  prob¬ 
lem  is  finite  and  the  underlying  probability  distributions  are  not  often  known,  Bayes  error 
cannot  be  calculated  using  the  mathematical  formulations  given  and  may  be  estimated 
by  calculating  upper  and  lower  bounds  on  the  error.  Any  classifier  whose  error  rate  falls 
within  these  bounds  is  considered  to  be  at  optimal  performance  for  the  given  problem.  The 
most  common  error  bounding  technique  employs  the  Leave-One-Out  method  to  estimate 
the  upper  bound  and  the  Resubstitution  method  to  estimate  the  lower  bound  [11,  18]. 


Figure  2.9  Bayes  error  bounding:  Resubstitution  and  Leave- One- Out. 

2.6.2. 1  Leave-One  Out.  In  the  Leave-One-Out  method,  the  entire  data  set 
minus  one  sample  is  used  to  train  a  classifier  and  the  remaining  sample  is  then  used  to  test 
the  classifier.  This  process  is  repeated  until  each  sample  has  been  left  out  and  used  to  test 
the  classifier.  The  error  is  proportion  of  the  total  number  of  test  samples  misclassified. 
Error  estimates  are  performed  in  this  manner  over  a  range  of  a  classifier  parameter,  such 
as  number  of  hidden  layer  nodes.  This  produces  an  upper  bound  for  Bayes  error  as  shown 
in  Figure  2.9. 
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2. 6. 2. 2  Resubstitution.  Using  the  Resubstitution  method,  the  entire  data 
set  is  used  for  both  training  and  testing  the  classifier.  The  error  is,  again,  the  number  of 
test  samples  misclassified  and  the  error  estimates  are  calculated  for  a  range  of  classifier 
parameter  values.  The  Resubstitution  method  produces  a  lower  bound  as  shown  in  Figure 
2.9. 

2. 7  Summary 

This  chapter  provides  the  theoretical  basis  for  the  methods  used  throughout  this 
research  effort.  The  use  of  neural  networks  for  pattern  recognition  problems  is  discussed 
with  particular  attention  given  to  the  multi-layer  perceptron.  Furthermore,  the  impor¬ 
tance  of  properly  selecting  the  learning  rate  and  the  momentum  constant  for  multi-layer 
perceptron  classifiers  and  the  need  to  reduce  the  input  feature  set  in  order  to  achieve  a 
suitable  classifier  is  stressed.  Finally,  the  evaluation  of  such  a  classifier  using  Bayes  error 
is  also  presented. 


III.  Methods  &  Results 


3. 1  Introduction 

This  chapter  presents  the  methodology  used  to  design  and  evaluate  a  multi-layer 
perceptron  to  classify  the  given  environmental  data  sets.  Because  this  methodology  is 
sequential  and  some  of  the  steps  in  the  process  require  results  from  the  previous  step, 
the  intermediate  results  and  the  final  results  are  also  presented  in  this  chapter.  All  tech¬ 
niques  discussed  are  based  on  the  theory  presented  in  Chapter  II  and  are  implemented  in 
MATLAB®.  Appendix  B  contains  all  MATLAB®  functions  employed  in  this  research. 

3. 2  Data  Description 

This  thesis  demonstrates  the  classifier  design  methodology  using  two  data  sets  which 
were  provided  by  AFTAC.  The  first  data  set  consists  of  measurements  of  three  classes 
of  stainless  steel;  ss304,  ss316,  and  ss400.  The  input  features  for  the  stainless  steel  data 
(shown  in  Table  3.1)  are  the  percent  by  atom  of  six  radioisotopes,  percent  by  weight  of 
the  elements  from  Lithium  to  Californium,  and  the  associated  measurement  errors. 


Radioisotopes 

%  by  Atom 

Measurement  Error 

Uranium  234 

U234-P 

U234E 

Uranium  235 

U23SP 

U235E 

Uranium  236 

U236P 

U236E 

Plutonium  240 

Pu24oP 

PU24oE 

Plutonium  241 

PU2A\P 

Pu24iE 

Plutonium  242 

PU242P 

PU242E 

Elements 

%  by  Weight 

Measurement  Error 

Lithium  thru  Californium 

X 

XE 

Table  3.1  Stainless  steel  features.  Note:  X  is  the  elemental  symbol  for  the  elements 
between  Lithium  and  Californium  inclusive  in  the  periodic  table;  a  full  listing 
of  these  elements  and  symbols  is  provided  in  Appendix  C. 
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This  results  in  a  196-(iimensional  feature  set.  The  features  for  the  actinide  data  consist  only 
of  the  radioactive  particle  percentages  and  the  associated  measurement  errors  (Table  3.2), 
resulting  in  an  18-dimensional  feature  set.  Unlike  the  three-class  stainless  steel  data,  the 
actinide  class  membership  may  be  determined  in  one  of  two  ways.  Due  to  the  sensitive 
nature  of  AFTAC’s  mission,  the  samples  are  labeled  by  generic,  hyphenated  alphanumeric 
descriptors  {A-1,  A-S,  B-1,  C-1,  . . . ,  M-1)  which  represent  the  sample  origin.  Allowing 
each  descriptor  to  represent  a  different  class  results  in  55  classes.  However,  allowing  the 
class  structure  to  be  based  solely  on  the  alphabetic  part  of  the  descriptors  (A,  B,  C,  . . . , 
M)  results  in  15  classes.  This  research  performs  classification  using  both  class  structures. 
Therefore,  essentially  three  data  sets  are  used  for  classification,  a  stainless  steel  data  set 
and  two  actinide  data  sets. 


Radioisotopes 

%  by  Atom 

Measurement  Error 

%  Measurement  Error 

Uranium  234 

U234P 

U234E 

U234PE 

Uranium  235 

U235P 

U235E 

U235PE 

Uranium  236 

U236P 

U23sE 

U2Z&PE 

Plutonium  240 

PU240P 

Pu24oE 

Pu24oPE 

Plutonium  241 

PU24lP 

Pu24iE 

Pu24\PE 

Plutonium  242 

PU242P 

PU242E 

PU242PE 

Table  3.2  Actinide  features. 


The  data  is  configured  in  matrix  form  such  that  the  first  column  contains  an  integer 
class  designator,  the  remaining  columns  represent  input  features,  and  each  row  represents 
an  individual  sample.  The  configurations  for  each  data  set  are  given  in  Equations  3.1  - 
3.3.  In  Equation  3.1,  classes  one,  two,  and  three  represent  ss304,  ss316,  ss400,  respectively. 
The  full  set  of  alphanumeric  descriptors  and  the  associated  integer  class  values  are  given 
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in  Appendix  E. 


Class 

Feature  1 

Feature  2  • 

■  ■  Feature  196 

1 

U234P 

U235P 

CfE 

1 

U234P 

U235P 

CfE 

Steel  = 


3 

U234P 

U23sP  •  • 

CfE 

3 

U234P 

U23bP  •  • 

CfE 

Class 

Feature  1 

Feature  2 

•  •  ■  Feature  18 

1 

U234P 

U234E 

•  •  •  PU242PE 

Actinidei  = 

1 

U234P 

U234E 

•  • .  PU242PE 

55 

U234P 

U234E 

•  •  ■  PU242PE 

55 

U234P 

U234E 

•  •  •  PU242PE 

Class 

Feature  1 

Feature  2 

■  •  •  Feature  18 

1 

U234P 

U234E 

•  •  •  PU242PE 

Actinide2  = 

1 

U234P 

U234E 

•  •  •  PU242PE 

15 

U234P 

U234E 

•  •  •  PU242PE 

15 

U234P 

U234E 

•  ■  •  PU242PE 

(3.1) 


(3.2) 


(3.3) 


3. 3  Process  Overview 

For  each  of  the  data  sets  described,  the  process  illustrated  in  Figure  3.1  is  used  to 
develop  a  neural  network  classifier.  The  data  is  first  preprocessed  and  the  neural  network 
architecture  is  selected.  The  parameters  for  the  selected  architecture  and  training  algo¬ 
rithm  are  chosen.  The  network  is  trained,  the  feature  set  is  reduced,  and  the  classifier  is 
trained  again  using  this  reduced  set.  Finally,  the  performance  of  the  classifier  is  evalu¬ 
ated  by  comparing  the  accuracy  of  the  classifier  to  the  Bayes  error  bounds.  The  following 
sections  provide  a  thorough  treatment  of  each  stage  of  the  process. 
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Figure  3.1  Process  overview. 


S.4  Data  Preparation 

The  data  provided  by  AFTAC  is  preprocessed  for  two  reasons.  First,  the  data  con¬ 
tains  gaps  where  values  for  features  in  some  samples  were  not  given.  Two  methods  of 
dealing  with  these  data  gaps  are  considered:  (1)  omit  any  samples  which  contain  gaps 
or  (2)  replace  missing  values  with  zeros.  Because  there  are  many  gaps  in  the  data  and 
because  omitting  entire  samples  means  losing  potentially  useful  data,  the  second  option  is 
employed.  The  data  is  also  preprocessed  to  remove  any  features  for  which  all  values  are 
identical;  these  homogeneous  features  are  useless  in  classification  (MATLAB®  function 
removeh.m,  Section  B.l).  This  results  in  a  stainless  steel  data  set  which  contains  only  the 
50  features  shown  in  Table  3.3.  Unlike  the  steel  data,  the  actinide  data  set  contains  no 
homogeneous  features  and,  therefore,  remains  unchanged. 
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After  removing  the  homogeneous  features,  the  remaining  input  features  are  normal¬ 
ized  using  the  following  simple  normalization 


dataij  —  fij 


(3.4) 


where  i  represents  the  sample,  j  represents  the  feature,  fij  is  the  mean  of  the  feature  j, 
and  (7j  is  the  standard  deviation  of  feature  j  (MATLAB  function  normal.m,  Section  B.2). 
This  is  done  to  prevent  classes  with  features  of  high  magnitude  from  disproportionately 
affecting  the  weight  update  (Equation  2.9)  during  training. 


Features 

U234P 

U233P 

U23&P 

U234E 

U235E 

U236E 

PU240P 

Pu24qE 

C 

0 

F 

Na 

Mg 

Al 

Si 

P 

S 

Cl 

K 

Ca 

Ti 

V 

Cr 

Mn 

Fe 

Co 

Ni 

Cu 

Zn 

Zr 

Mo 

Ru 

Rh 

Ag 

Cd 

Sb 

Nd 

Sm 

Gd 

W 

Pb 

Bi 

Th 

U 

Pu 

OE 

CrE 

FeE 

Table  3.3  Steel:  feature  set  after  homogeneous  feature  removal. 


3.5  Architecture  Parameter  Selection 

For  reasons  discussed  in  Section  2.4,  the  multi-layer  perceptron  is  used  for  this  re¬ 
search  effort.  Because  the  number  of  input  features,  I,  and  the  number  of  output  nodes, 
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K,  are  determined  by  the  data  set,  the  only  free  architecture  parameters  are  the  number 
of  hidden  layer  nodes,  J,  and  the  type  of  activation  functions  to  be  used. 


A  larger  number  of  hidden  layer  nodes  allow  decision  regions  of  higher  complexity  to 
be  drawn  in  the  feature  space  but  also  diminishes  the  generalization  capability  of  the  multi¬ 
layer  perceptron.  This  loss  of  generalization  is  due  to  the  memorization  of  the  training  data 
allowed  by  a  large  number  of  hidden  nodes.  To  insure  good  generalization,  the  number  of 
training  samples,  n,  in  the  data  set  should  be  at  least  ten  times  the  number  of  weights 
in  the  network  [2,  18].  Using  this  rule-of-thumb,  the  following  upper  bound  for  J  may  be 
derived: 


J< 


—  -  K 
10  ^ 

I+K  +  1 


(3.5) 


Table  3.4  shows  the  suggested  J  values  using  this  rule-of-thumb  along  with  the  parameters 
actually  used  in  implementing  the  classifiers.  Because  the  suggested  values  for  J  are  low, 
a  value  of  five  is  chosen  arbitrarily  for  each  data  set  with  the  intention  to  show,  through 
demonstration,  that  good  generalization  and  reasonable  training  times  are  possible  using 
this  number  of  hidden  layer  nodes. 


Data  Set 

I 

K 

n 

Suggested  J 

Actual  J 

Number  of  Weights 

StainlessSteel 

50 

3 

473 

<  1 

5 

273 

Actinide-i 

18 

55 

1151 

<  1 

5 

425 

Actinide2 

18 

15 

1151 

2 

5 

185 

Table  3.4  Multi-layer  perceptron  architecture  parameters. 


In  addition  to  selecting  the  number  of  hidden  layer  nodes,  the  activation  functions 
must  be  selected  for  both  the  hidden  layer  and  the  output  layer.  Although  it  has  been 
suggested  that  the  Hyperbolic  Tangent  has  properties  which  may  make  it  more  desirable 
for  training  [14],  the  activation  functions  for  all  classifiers  in  this  research  are  chosen  to  be 
Sigmoid  -  Sigmoid.  That  is,  both  the  hidden  layer  nodes  and  the  output  layer  nodes  use  the 
Sigmoid  transformation  function.  AFIT  has  successfully  implemented  the  Sigmoid- Sigmoid 
combination  on  other  data  sets  [25]. 
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3.6  Backpropagation  Parameter  Selection 

The  backpropagation  with  momentum  algorithm  is  used  to  train  all  multi-layer  per- 
ceptrons  in  this  research.  The  only  parameters  for  which  values  must  be  chosen  when 
using  this  method  are  the  learning  rate,  p  ,  and  the  momentum,  a.  Using  standard  back- 
propagation  with  momentum,  learning  rate  and  momentum  remain  constant  throughout 
training.  Because  the  quality  of  learning  using  this  algorithm  is  highly  dependent  on  these 
parameters,  it  is  critical  to  properly  select  the  learning  rate  and  momentum  values. 

Fallside  suggests  training  several  classifiers  using  different  combinations  of  learning 
rate  and  momentum  values  and  selecting  the  parameter  values  which  give  rise  to  the 
smoothest  learning  curve  (Classification  Error  versus  Epoch)  with  the  smallest  error  [7]. 
In  general,  a  smoother  learning  curve  indicates  a  more  direct  descent  on  the  weight  space 
error  surface  and  it  suggests  more  stable  learning,  which  is  less  sensitive  to  the  initial  values 
of  the  weights.  By  plotting  the  learning  curves  of  the  different  learning  rate- momentum 
combinations  on  a  single  graph,  the  parameters  which  give  rise  to  the  best  learning  may 
be  selected  manually.  These  collections  of  learning  curves  are  referred  to  collectively  as 
Fallside  plots. 

For  each  of  the  data  sets,  learning  rate  and  momentum  are  allowed  to  vary  from  .1  to 
.9  in  .2  increments,  producing  twenty-five  combinations  of  learning  rate  and  momentum. 
Twenty-five  classifiers,  one  for  each  combination  of  learning  rate  and  momentum,  are 
trained  for  each  data  set;  all  training  is  performed  over  1000  epochs  (MATLAB®  function 
fallside.m,  Section  B.4).  The  data  sets  are  not  divided  into  training  and  test  sets,  rather 
the  entire  data  sets  are  used  for  training  each  combination.  One  learning  curve  is  produced 
for  each  combination  of  learning  rate  and  momentum.  In  order  to  compare  these  twenty- 
five  traces,  five  plots  consisting  of  five  traces  are  produced  for  each  data  set  (MATLAB® 
function  postf.m,  Section  B.5).  For  clarity,  only  the  best  and  worst  learning  curves  for 
each  data  set  are  shown  here  (Figures  3.2-3.4).  The  full  set  of  Fallside  plots  for  each  data 
set  are  shown  in  Appendix  D. 

3.6.1  Steel  Results.  For  the  Steel  data  set  (Figure  3.2),  training  with  p  =  a  =  .1 
is  clearly  better  than  p  =  .3  and  a  =  .9.  Although  the  latter  curve  contains  very  little 
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oscillation,  it  levels  off  at  epoch  397  with  a  classification  error  of  8.2%.  This  classification 
error  may  indeed  decline  again  if  the  network  is  trained  for  more  epochs  (200,000  epochs 
for  example)  but  there  is  no  reason  to  select  these  parameter  values  because  the  best  case 
parameter  values  {r]  =  a  =  .1)  produce  a  0.63%  error  within  the  first  1000  epochs.  The 
oscillations  around  epoch  200  and  between  epochs  600  and  800  are  of  little  consequence 
with  respect  to  the  overall  learning  of  the  classifier  using  these  parameters.  Consequently, 
the  best  case  parameters,  r]  =  a  =  .1,  are  used  for  subsequent  classifiers  for  the  Steel  data 
set. 


Figure  3.2  Steel:  learning  curves  for  the  best  and  worst  combinations  of  learning  rate 
and  momentum. 

It  should  be  noted  that  the  parameter  selection  process  may  be  repeated  using  a 
refined  range  of  values  such  as  (0,.l).  The  accuracy  will  be  no  better,  on  average,  if  the 
error  rates  are  already  within  the  Bayes  error  bounds.  It  may,  however,  provide  some 
improvement  in  training  times. 
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3.6.2  Aciinidei  Results  (55-class).  Unlike  the  best  Steel  learning  curve  which 
approaches  .60%  classification  error,  none  of  the  Actinidei  classifiers  (Figure  3.3)  perform 
better  than  60%  error.  The  worst  learning  curve  (t?  =  .9;  a  =  .3)  reaches  a  minimum  of 
86%  error  under  200  epochs,  while  the  best  learning  (r)  =  .l;a  =  .5)  curve  reaches  60% 
error  by  epoch  800  at  which  point  it  levels  off  and  begins  oscillating.  Although  this  level  of 
error  is  not  acceptable  for  a  final  classifier,  the  goal  here  is  to  select  the  best  combination 
of  parameters  to  continue  in  the  design  of  the  classifier.  The  overall  performance  of  the 
classifier  is  evaluated  in  later  analysis.  Therefore,  a  learning  rate  of  .1  and  a  momentum 
of  .5  are  selected  for  subsequent  classifiers  for  this  data  set. 


Epoch 

Figure  3.3  Actinidei  (55-class):  learning  curves  for  the  best  and  worst  combinations  of 
learning  rate  and  momentum. 

3.6.3  Actinide2  Results  (15-class).  With  the  15-class  Actinide^  data  set,  the 
error  rates  achieved  by  all  of  the  classifiers  are  lower  than  those  of  the  Actinidei  classi¬ 
fiers.  The  best  learning  curve  {r]  =  a  =  .1)  approaches  20%  error  but  oscillates  heavily 
throughout  training.  Though  this  is  not  usually  desirable,  all  of  the  Fallside  traces  which 
give  comparable  error  values  also  oscillate  heavily  (see  Appendix  D).  The  poorest  learning 
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occurs  with  77  =  a  =  .9  in  which  the  error  drops  immediately  to  45%  and  where  it  remains 
throughout  the  remainder  of  the  training.  Based  on  these  plots,  the  learning  rate  and  the 
momentum  are  both  set  to  .1  for  further  classifier  training. 

Fallside  Plots 


Figure  3.4  Actinide2  (15-class):  learning  curves  for  three  combinations  of  learning  rate 
and  momentum. 

3.7  Initial  Training 

Once  the  parameters  for  each  data  set  are  selected,  initial  training  is  performed.  This 
training  provides  an  initial  indication  of  the  performance  of  the  classifiers,  and  allows  the 
selection  of  appropriate  training  times.  By  training  a  multi-layer  perceptron  for  a  large 
number  of  epochs,  the  training  data  may  be  memorized.  That  is,  the  network  can  classify 
the  entire  training  set  with  100%  accuracy.  Initially  this  may  seem  to  be  a  desirable 
outcome,  however,  when  this  is  done  the  network  loses  its  ability  to  generalize  (classify 
samples  which  were  not  used  in  training).  The  number  of  training  epochs  should  be  selected 
at  the  point  where  this  loss  of  generalization  begins  to  occur. 
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The  method  used  to  determine  the  point  at  which  the  network  begins  to  lose  its 
generalization  ability  is  to  divide  the  data  set  into  two  sets  (training  and  test)  and  track 
the  classification  error  for  the  test  set  while  training  the  network  with  the  training  set. 
After  each  epoch  of  training,  each  sample  in  the  test  set  is  classified  using  the  current 
set  of  weights  and  the  percentage  of  samples  misclassified  is  stored  (MATLAB  function 
tev.m,  Section  B.8).  The  number  of  training  epochs  is  chosen  where  the  test  set  error 
begins  to  increase. 

As  noted  above,  the  data  sets  are  each  divided  into  two  subsets,  training  and  testing 
(MATLAB®  function  randchoose.m,  Section  B.3).  The  training  subset  of  each  data  set 
contains  two  thirds  of  the  samples  by  class,  while  the  test  set  contains  the  remaining  third 
by  class.  This  splitting  of  the  data  sets  for  training  and  testing  presents  a  slight  problem  in 
the  case  of  the  actinide  data  sets.  In  these  data  sets,  many  of  the  classes  contain  only  one 
sample.  For  the  classes  in  which  this  is  true,  the  sample  is  duplicated  in  both  the  training 
and  test  subsets. 

In  order  to  insure  that  the  results  obtained  by  this  method  are  insensitive  to  the 
initial  values  of  the  weight  matrices,  each  data  set  is  trained  ten  times  beginning  each 
training  run  with  different  initial  weight  matrices  (randomly  chosen  values  as  discussed 
in  Section  2.4.2).  The  Steel  classifier  is  trained  for  500  epochs  on  each  run,  while  the 
Actinidei  and  the  Actinidei  classifiers  are  trained  for  20,000  epochs  on  each  run.  The 
mean  of  the  training  curves  and  the  mean  of  the  test  curves  for  each  data  set  are  shown 
in  Figures  3. 5-3. 7. 

3.7.1  Steel  Results.  For  the  Steel  data,  the  test  set  error  reaches  a  minimum 
of  11%  at  epoch  65  at  which  point  it  begins  to  increase  and  the  training  set  error  is  still 
declining.  Therefore,  the  number  of  epochs  for  training  the  Steel  data  is  selected  as  65.  It 
should  be  noted  that  this  value  is  the  point  at  which  the  average  curve  is  at  a  minimum. 
The  individual  training  curves  reach  minimum  error  at  different  points  in  training.  The 
range  of  epochs  at  which  the  minimum  error  is  reached  for  the  different  learning  curves  is 
[25  180]. 
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Classification  Error  versus  Epoch 


Figure  3.5  Steel:  training  and  test  set  error  traces  over  500  epochs. 


The  confusion  matrices  after  65  training  epochs  for  both  the  training  and  the  test  sets 
are  computed  (Tables  3.5  and  3.6).  Each  row  in  these  matrices  indicates  the  actual  class 
of  samples  in  the  data  set,  while  columns  represent  the  class  assigned  to  the  samples  by 
the  classifier.  A  value  in  the  table  at  row  i  and  column  j  indicates  the  number  of  samples 
of  class  i  which  are  classified  as  class  j  by  the  classifier.  Therefore,  values  on  the  main 
diagonal  represent  the  number  of  samples  correctly  classified  in  each  class.  For  example, 
in  Table  3.5,  148  samples  of  class  one  are  classified  as  class  one,  while  six  are  misclassified 
as  class  two  and  two  are  misclassified  as  class  three. 


Actual  Class 

Assigned  Class 

%  Correct 

1 

2 

3 

1 

148 

6 

2 

95 

2 

4 

78 

0 

95 

3 

0 

0 

76 

100 

Table  3.5  Steel:  confusion  matrix  for  the  training  set. 
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Table  3.6  Steel:  confusion  matrix  for  the  test  set. 

The  purpose  of  the  confusion  matrices  is  to  illustrate  the  accuracy  by  class.  For  a  data 
set  in  which  the  samples  are  not  evenly  distributed  among  the  classes,  the  classifier  may 
sacrifice  accuracy  in  the  classes  which  contain  few  samples,  while  achieving  high  accuracy 
in  classes  which  contain  many  samples  and  maximizing  the  overall  accuracy.  The  Steel 
samples  are  evenly  distributed  over  the  classes  and  the  accuracy  is  consistent  across  the 
classes. 

S.7.2  Actinidei  Results  (55-class).  The  Actinidei  training  and  test  set  error 
curves  (Figure  3.6)  asymptotically  approach  57%  and  59%  respectively.  At  20,000  epochs, 
the  error  curves  are  still  on  a  downward  trend  but  have  leveled  off  significantly  by  this  point. 
As  a  result,  these  error  rates  are  considered  to  be  indicative  of  the  classifier  performance 
on  this  data  set. 

The  poor  performance  of  the  multi-layer  perceptron  classifier  on  this  data  set  is  due 
to  the  large  number  of  classes  and  the  small  number  of  samples  per  class.  Although  the 
confusion  matrices  are  too  large  to  display  (55x55),  Tables  3.7  and  3.8  show  the  number 
of  samples  per  class  and  the  percentage  of  samples  misclassified  within  each  class.  It  is 
obvious  that  the  more  heavily  populated  classes  are  classified  more  accurately  and  that,  in 
general,  classes  containing  less  than  ten  samples  have  zero  accuracy.  According  to  Foley,  if 
the  ratio  of  the  number  of  samples  per  class  to  the  number  of  features  is  greater  than  three, 
the  training  set  error  is  approximately  the  test  set  error  and  they  both  approach  Bayes  error 
rate  [10].  For  the  Actinidei  data  set,  there  are  only  18  features  and  the  average  number  of 
samples  per  class  is  21  with  only  six  of  the  55  classes  containing  more  than  54  samples  (the 
number  of  samples  per  class  required  to  meet  the  Foley  ratio  requirement  for  the  given 
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data).  Therefore,  it  is  doubtful  that  further  training  of  the  network  will  improve  either 
the  training  set  error  or  the  test  set  error.  As  a  result,  further  analysis  of  this  data,  with 
the  exception  Bayes  error  bounding,  is  abandoned  in  favor  of  the  15-class  Actinide2  data. 
Bayes  error  bounding  is  still  performed  on  the  Actinidei  data  set  (Section  3.9)  to  confirm 
the  assertion  that  additional  training  will  not  significantly  increase  the  performance  of  the 
classifier  using  this  data. 


Classification  Error  versus  Epoch 


Figure  3.6  Actinidei  (55-class):  training  and  test  set  error  traces  plotted  over  20,000 
epochs. 


I 
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Table  3.7  Actinidei  (55-class):  training  set  accuracy  by  class. 
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Class 


Samples/ Class  %  Correct 


Class 


Samples/Class  %  Correct 
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3.1.3  Actinidei  Results  (15-class).  Unlike  the  Actinidei  learning  curves,  the 
training  and  test  error  traces  (Figure  3.7)  for  the  Actinide2  data  drop  to  21%  and  26%, 
respectively,  within  the  first  3,000  epochs,  but  each  only  drops  an  additional  .1%  over  the 
next  17,000  epochs.  Although  the  minimum  error  in  the  test  error  curve  (24%)  occurs  at 
epoch  15,399,  training  for  this  number  of  epochs  greatly  increases  the  computation  time 
on  a  data  set  of  this  size  (1151  samples)  without  a  significant  decrease  in  the  test  set  error. 
To  avoid  this  increase  in  computation  time,  the  number  of  training  epochs  is  selected  as 
3,000. 


Classification  Error  versus  Epoch 


Figure  3.7  Actinide2  (15-class):  training  and  test  set  error  traces  over  20,000  epochs. 

The  confusion  matrices  for  the  Actinide2  data  set  are  shown  in  Tables  3.9  and  3.10. 
The  accuracy  is  clearly  lower  in  the  classes  which  contain  fewer  samples.  Classes  12  and 
13,  which  contain  the  majority  of  the  samples  in  the  entire  data  set,  dominate  the  training 
and,  as  a  result,  the  classifier  performs  well  on  these  classes  in  both  the  training  and  test 
sets.  In  addition,  the  next  most  populated  class,  class  5,  is  the  only  other  class  with  an 
accuracy  greater  than  zero.  This  suggests  that  training  may  be  more  successful  on  the 
remaining  classes  if  they  are  more  heavily  populated. 
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Actual  Class 

Assigned  Class 

%  Correct 

1 

- 1 

2 

- 1 

3 

-  1 

1 

4 

5 

6 

7 

8 

9 

10 

1 

11 

12 

1 

13 

14 

15 

1 

0 

0 

0 

0 

3 

0 

0 

0 

0 

0 

4 

17 

0 

0 

0 

2 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

D 

0 

0 

0 

0 

3 

0 

0 

0 

0 

_ 1 

0 

0 

0 

0 

0 

-1 

2 

2 

0 

0 

0 

4 

0 

3 

0 

0 

0 

0 

0 

1 

0 

0 

0 

5 

0 

0 

« 

3 

23 

0 

0 

46 

6 

0 

0 

Al 

0 

_ 1 

0 

0 

0 

0 

0 

0 

_ 

0 

1 

0 

0 

0 

7 

0 

0 

2 

0 

0 

0 

0 

0 

2 

2 

0 

0 

8 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

2 

0 

0 

0 

9 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

10 

0 

0 

0 

0 

_ 

0 

0 

0 

2 

6 

0 

0 

0 

11 

- 1 

0 

0 

0 

0 

0 

0 

2 

0 

_ 

0 

0 

0 

12 

0 

0 

0 

0 

0 

13 

0 

0 

0 

0 

0 

89 

14 

0 

8 

0 

_ 

D 

0 

15 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

Table  3.9  Actinide2  (15-class):  confusion  matrix  for  the  training  set. 
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Assigned  Class 

Actual  Class 

1 

2 

3 

4 

5 

6 

IB 

8 

9 

10 

11 

12 

13 

14 

15 

%  Correct 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

11 

0 

0 

0 

2 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

3 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

’l 

2 

0 

0 

0 

4 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

5 

0 

0 

0 

0 

9 

0 

0 

0 

0 

0 

0 

0 

17 

0 

0 

35 

6 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

7 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

2 

1 

0 

0 

0 

8 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

2 

0 

0 

0 

9 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

10 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

3 

0 

0 

0 

11 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

12 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

94 

18 

0 

0 

87 

13 

0 

0 

0 

0 

7 

0 

0 

0 

0 

0 

0 

28 

183 

0 

0 

84 

14 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

1 

2 

0 

0 

0 

15  1 

0 

» . 

0 

. 

0 

0 

0. 

0 

0 

0 

1 

0 

0 

0 

Table  3.10  Actinide2  (15-class):  confusion  matrix  for  the  test  set. 


3.7.^  Additional  Analysis.  It  should  be  noted  that  two  other  neural  network  al¬ 
gorithms,  fuzzyARTmap  [5]  and  Multi-layer  perceptron  Iterative  Construction  Algorithm 
(MICA,  developed  at  AFIT)[22],  were  used. to  classify  the  actinide  data  sets.  The  fuzz¬ 
yARTmap  architecture,  which  is  particularly  well  suited  for  data  with  only  a  few  sam¬ 
ples  per  class,  performs  no  better  than  the  multi-layer  perceptron  on  either  the  55-class 
Actinidei  data  or  the  15-class  Actinide2.  Conversely,  the  MICA  network,  which  adds 
hidden  layer  nodes  until  it  is  able  to  separate  the  training  data,  classifies  the  Actinide2 
training  set  data  with  100%  accuracy  but  it  requires  851  hidden  layer  nodes  to  do  so  and 
the  test  error  is  92%.  This  indicates  that  the  feature  set  for  this  data  is  not  sufficient  to 
separate  the  data  and  maintain  generalization;  good  features  are  required  to  produce  good 
classifiers.  Although  the  55-class  Actinide^  classification  using  MICA  was  not  completed 


due  to  computation  time,  the  error  rates  would  be  higher  because  of  the  larger  number 
of  classes.  From  this  additional  analysis,  it  is  concluded  that  the  poor  performance  of  the 
multi-layer  perceptron  on  the  data  is  due  not  only  to  the  large  number  of  classes  and  the 
small  number  of  samples  per  class  but  also  to  overlap  in  the  feature  space. 

3. 8  Feature  Reduction  &  Classifier  Retraining 

Once  initial  training  is  complete,  the  features  for  each  data  set  are  ranked  according 
to  their  saliency,  the  least  salient  features  in  each  data  set  are  removed,  and  the  classifiers 
are  retrained  using  the  reduced  data  sets.  As  discussed  in  Section  3.7.2,  the  Actinidei 
data  set  is  not  considered  in  this  section. 

The  general  process  in  this  portion  of  the  research  is  to  rank  order  the  features  by 
forward  sequential  selection  and  plot  the  error  associated  with  the  addition  of  each  feature 
to  the  feature  nucleus  (MATLAB®  function  fselct.m,  Section  B.9).  As  each  additional 
feature  is  added  to  the  nucleus,  the  classification  error  decreases.  At  some  point,  a  mini¬ 
mum  error  is  reached.  The  number  of  features,  /,  at  this  point  is  the  number  of  features 
which  should  be  used  to  classify  the  given  data  set.  Given  this,  the  first  /  features  of  the 
nucleus  are  considered  to  be  the  salient  features  and  all  other  features  are  removed  from 
the  data  set  under  consideration.  Finally,  a  classifier,  using  the  parameters  determined 
previously,  is  trained  on  the  reduced  data  set. 

3.8.1  Steel  Results.  The  classifiers  in  the  forward  sequential  search  are  trained 
for  65  epochs  with  a  learning  rate  and  a  momentum  of  .1.  The  resulting  ranking  of  features 
is  shown  in  Table  3.11.  Figure  3.8  shows  the  classification  error  as  features  are  added  to  the 
nucleus.  Labels  on  the  x-cixis  correlate  to  the  ranks  in  Table  3.11.  For  example,  after  the 
first  pass  through  all  of  the  features,  the  addition  of  Ni  to  the  nucleus  (which  is  initially 
empty)  results  in  the  best  classification  error  (24.1%).  So,  Ni  is  added  to  the  nucleus.  On 
the  second  pass,  the  addition  of  Mo  to  the  nucleus  (which  now  contains  Ni)  results  in  the 
best  classification  error  (6.8%).  The  minimum  error  (3.8%)  is  achieved  with  the  addition 
of  feature  34  to  the  nucleus.  Therefore,  the  feature  set  is  reduced  to  include  only  the  top 
34  features  out  of  the  original  50  features. 
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After  selecting  an  appropriate  feature  set,  the  classifier  is  retrained  on  the  reduced 
feature  set.  Although  the  classifier  could  be  satisfactorily  trained  in  65  epochs,  it  is  trained 
for  500  epochs  in  order  to  compare  the  results  to  Figure  3.5.  Making  this  comparison,  it  is 
clear  that  reducing  the  feature  set  does  not  diminish  the  performance  of  the  classifier.  On 
the  contrary,  reducing  the  feature  set  improves  the  training  and  test  set  error  (at  epoch 
500)  from  4%  to  .6%  and  13%  to  5.5%,  respectively. 
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Symbol 

Rank 

Symbol 

Ni 

26 

K 

Mo 

27 

Co 

U2S4P 

28 

Cu 

Mg 

29 

Ti 

Cr 

30 

Mn 

Pu24oP 

31 

Fe 

32 

Ag 

s 

33 

Zn 

U235P 

34 

Sb 

Ca 

35 

Ru 

Zr 

36 

Nd 

U236P 

37 

V 

U234E 

38 

Pb 

Pu24iE 

39 

Cd 

U235E 

40 

Sm 

PU24qE 

41 

Th 

Na 

42 

U 

F 

43 

Pu 

Si 

44 

P 

PU242P 

45 

Gd 

Al 

46 

W 

C 

47 

Bi 

Cl 

48 

OE 

0 

49 

CrE 

Rh 

50 

FeE 

Table  3.11  Steel:  features  ranked  by  saliency. 


Apparent  abnormalities  in  the  feature  ranking,  such  as  the  error  in  a  measurement 
ranking  higher  than  the  measurement  itself,  may  be  resolved  by  running  feature  selection 
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multiple  times  and  determining  the  relative  frequency  of  each  feature’s  ranks.  Doing  this, 
Ruck  et  al  showed,  for  a  given  data  set,  that  the  ranking  of  the  most  salient  and  the 
least  salient  features  were  more  consistent  than  features  in  the  middle  of  the  rankings 
[27].  The  rankings  of  features  in  the  middle  varied  widely  over  feature  selection  runs.  For 
example,  the  top  five  and  the  bottom  five  Steel  features  may  rank  consistently  in  the  top 
and  bottom,  respectively,  over  multiple  runs.  However,  the  rank  of  the  middle  features 
(feature  6  to  feature  45  in  Table  3.11)  may  vary  widely  over  multiple  runs.  Therefore,  it 
may  be  insignificant  that  U2Z6E  outranks  U236P  for  the  one  feature  selection  run  shown. 
Multiple  runs  could  be  used  to  verify  the  rankings. 


Classification  Error  versus  Feature 


Figure  3.8  Steel:  classification  error  versus  feature  added  to  the  nucleus  during  forward 
sequential  selection. 
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Classification  Error  versus  Epoch 


Epoch 


Figure  3.9  Steel:  training  and  test  set  error  traces  over  500  epochs. 


The  confusion  matrices  are  shown  for  one  training  run  on  the  reduced  feature  set  (Ta¬ 
bles  3.12  and  3.13).  Using  the  reduced  feature  set,  the  classifier  shows  marked  improvement 
on  all  classes  of  both  the  training  and  the  test  sets. 


Actual  Class 

Assigned  Class 

%  Correct 

■1 

2 

3 

1 

155 

1 

99 

2 

0 

82 

0 

100 

3 

0 

0 

76 

100 

Table  3.12  Steel  (reduced  feature  set):  confusion  matrix  for  the  training  set. 
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Actual  Class 

Assigned  Class 

%  Correct 

1 

2 

3 

1 

71 

2 

5 

91 

2 

2 

40 

0 

95 

3 

1 

0 

38 

97 

Table  3.13  5^ee/(reduce(i  feature  set);  confusion  matrix  for  the  test  set. 

Overall  the  feature  selection  is  successful.  The  feature  set  is  reduced  from  50  features 
to  34  features  with  no  decrease  in  the  overall  accuracy  of  the  classifier.  In  fact,  on  this  data 
set,  the  reduced  feature  set  provides  better  classification  than  the  full  feature  set.  This 
improvement  in  accuracy  results  from  omission  of  features  which,  in  terms  of  classification, 
are  analogous  to  noise.  As  with  most  systems,  noise  results  in  diminished  performance. 

3.8.2  Actinide2  Results  (15-class).  The  classifiers  in  the  forward  sequential 
selection  on  the  Actinide2  data  are  trained  for  3,000  epochs.  Table  3.14  shows  the  features 
ranked  by  saliency.  Again,  plotting  the  features  against  the  classification  error,  as  shown 
in  Figure  3.10,  indicates  an  appropriate  number  of  features.  In  this  case,  the  ninth  feature 
added  to  the  nucleus  results  in  the  lowest  overall  classification  error.  Therefore,  the  feature 
set  is  reduced  to  the  first  nine  features  listed  in  Table  3.14. 

The  classifier  is  retrained  on  the  reduced  feature  set.  The  training  and  test  set 
learning  curves  are  shown  in  Figure  3.11.  The  overall  error  of  both  the  training  set  (24% 
error)  and  test  set  (26%)  is  higher  than  the  associated  error  when  using  the  full  feature 
set.  However,  the  ultimate  goal  of  feature  selection/reduction  is  to  reduce  the  feature  set 
to  reduce  training  times  without  significantly  reducing  the  accuracy  of  the  classifier.  If  the 
reduced  feature  set  error  is  greater  than  that  of  the  full  feature  set  error,  but  both  are  less 
than  the  upper  bound  on  Bayes  error,  the  difference  in  the  accuracy  using  the  two  feature 
sets  may  be  considered  insignificant.  This  comparison  is  made  in  Section  3.9  where  the 
Bayes  error  bounding  procedure  and  results  are  presented. 
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Figure  3.10 


- 1 

Rank 

Symbol 

Rank 

Symbol 

1 

UizeP 

10 

PU24qE 

2 

U234PP 

11 

U2Z5E 

3 

PU242P 

12 

Pu24\E 

4 

Pu24oPE 

13 

U2Z4P 

5 

PU240P 

14 

U2Z6PE 

6 

U2Z&E 

15 

U2Z5P 

7 

PU242E 

16 

Pu24iPE 

8 

PU242PE 

17 

U2Z5PE 

9 

PU24\P 

18 

U234E 

Table  3.14  Actinide2  (15-class):  features  ranked  by  saliency. 


Classification  Error  versus  Feature 


Actinide2  (15-class):  Classification  error  versus  feature  added  to  the  nucleus 
during  forward  sequential  selection. 
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Classification  Error  versus  Epoch 


Epoch 

Figure  3.11  Actinidei  (15-class):  training  and  test  set  error  traces  over  3,000  epochs. 


The  confusion  matrices  for  the  Actinide2  data  set  are  shown  in  Tables  3.15  and  3.16. 
The  confusion  matrices  show  that  there  are  shifts  in  the  by-class  accuracies  in  both  the 
training  and  test  sets.  However,  the  shift  in  overall  accuracy  is  less  than  1%  in  both 
sets.  Although  this  seems  to  be  contrary  to  the  decrease  in  the  error  of  both  sets  as 
demonstrated  by  Figures  3.7  and  3.11,  it  should  be  noted  that  the  confusion  matrices  are 
computed  for  one  training  run,  while  each  curve  in  the  figures  represent  an  average  of  ten 
classifier  training  runs.  The  confusion  matrices  show  that,  even  with  the  reduced  feature 
set,  the  classifier  still  focuses  on  and  performs  well  on  the  classes  which  are  more  heavily 
populated. 


Actual  Class 

Assigned  Class 

%  Correct 

1 

2 

3 

4 

5 

6 

7 

8 

9 
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31 

0 
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13 

0 

0 
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0 

44 

0 

89 

14 

0 

0 

0 

2 

4 

2 

o 

25 

15 
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0 
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0 

0 

1 

0 

0 

0 

Table  3.15  Actinide2  (reduced  feature  set):  confusion  matrix  for  the  training  set. 
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Actual  Class 

Assigned  Class 
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1 
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14 

0 

1 

3 
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15 

0 

1 

0 

0 

0 

Table  3.16  Actinide2  (reduced  feature  set):  confusion  matrix  for  the  test  set. 


S.  9  Bayes  Error  Bounding  &  Classifier  Evaluation 

In  order  to  evaluate  the  performance  of  the  classifiers,  the  error  rates  achieved  are 
compared  to  the  Bayes  error  bounds  for  each  data  set.  Because  Bayes  error  rate  represents 
the  best  performance  that  any  classifier  can  achieve  on  average,  the  classifier  is  considered 
to  be  a  good  classifier  for  the  given  data  set  if  its  test  set  error  falls  within  the  Bayes  error 
bounds. 

In  estimating  the  Bayes  error  bounds  for  the  given  data  sets,  only  the  upper  bound 
(determined  by  the  Leave-One-Out  Method)  is  calculated.  The  lower  bound  using  the 
multi-layer  perceptron  is  rather  useless  because  it  approaches  zero  as  the  number  of  hid¬ 
den  nodes  and  the  training  times  are  increased[l9].  Because  Bayes  error  bounding  is  so 
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computationally  intense,  the  upper  bound  is  computed  using  one  to  ten  hidden  layer  nodes 
for  each  of  the  data  sets. 

3.9.1  Steel.  The  Bayes  error  bound  for  the  Steel  data  is  shown  in  Figure  3.12. 
Over  the  range  of  hidden  layer  nodes  from  3  to  6,  the  error  remains  at  approximately  10%. 
Both  the  training  set  error  (.60%)  and  the  test  set  error  (5.5%)  are  clearly  less  than  the 
10%  upper  bound.  Therefore,  the  multi-layer  perceptron  performs  as  well,  on  average,  as 
any  other  classifier,  statistical  or  neural,  for  the  Steel  data  set. 


Upper  Bound  on  Bayes  Error 


Figure  3.12  Steel:  Bayes  error  bounds. 

3.9.2  Actinidei  (55-class).  The  upper  bound  on  Bayes  error  for  the  Actinidei 
data  set  is  shown  in  Figure  3.13.  At  five  hidden  layer  nodes,  the  upper  bound  on  Bayes 
error  is  63%.  Comparing  this  to  the  training  error  (57%)  and  the  test  error  (59%),  the 
classifier  performance  approaches  Bayes  optimality  for  the  data  set  as  given.  This  confirms 
the  assertion  made  in  Section  3.7.2  that  further  training  will  not  significantly  increase  the 
accuracy  on  this  55-class  data  structure.  Therefore,  the  decision  to  drop  the  Actinidei 
data  set  from  consideration  is  fully  justified. 


3-30 


Upper  Bound  on  Bayes  Error 


Figure  3.13  Actinidei  (55-class):  Bayes  error  bounds. 

S.9.3  Actinide2  (15-class).  The  Bayes  error  bound  for  the  AcUnide2  data  set  is 
shown  in  Figure  3.14.  At  five  hidden  layer  nodes,  the  upper  bound  on  Bayes  error  is  29%. 
Using  the  full  feature  set,  the  training  and  test  set  errors  (21%  and  26%,  respectively) 
are  within  this  upper  bound.  Furthermore,  the  reduced  feature  set  classification  error  also 
approaches  Bayes  error,  though  it  results  in  slightly  higher  training  and  test  set  errors 
(24%  and  26%,  respectively).  Because  Bayes  error  represents  the  best  that  any  classifier 
can  perform  on  average  on  a  given  data  set,  the  features  measured  and  the  number  of 
samples  per  class  in  the  Actinide2  data  are  sufficient  to  allow  only  74%  to  79%  accuracy. 
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Upper  Bound  on  Bayes  Error 


Figure  3.14  Actinide2  (15-class):  Bayes  error  bounds. 


3.10  Summary 

This  chapter  has  presented  the  methods  used  to  design  multi-layer  perceptron  clas¬ 
sifiers  for  three  data  sets  as  well  as  all  results  obtained  by  these  methods.  A  classifier  was 
designed  for  each  data  set  by  selecting  the  appropriate  classifier  parameters,  conducting 
initial  training,  reducing  the  feature  set/retraining,  and  comparing  the  performance  of  each 
to  Bayes  error.  Table  3.17  shows  the  final  results  of  this  research  effort. 


Data  Set 

#  Features 

Training  Error 

Test  Error 

Bayes  Error 

Steel 

50 

.6% 

5.5% 

11% 

Actinidei 

18 

57% 

59% 

63% 

Actinide2 

9 

24% 

26% 

29% 

Table  3.17  Summary  of  results:  the  Bayes  error  value  is  the  upper  bound  at  five  hidden 
layer  nodes. 
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The  training  and  test  set  errors  are  the  lowest  average  values  of  ten  training  runs  with 
each  run  beginning  with  different  random  weight  matrices.  The  Bayes  error  values  are  the 
upper  bound  estimates  using  five  hidden  layer  nodes.  The  results  are  fully  summarized 
and  conclusions  are  drawn  in  Chapter  IV. 
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IV.  Conclusion 


4.1  Introduction 

The  primary  objective  of  this  research  is  to  provide  a  methodology  by  which  environ¬ 
mental  professionals  may  design  neural  networks  to  classify  environmental  samples.  The 
secondary  goals  are  to  design  a  multi-layer  perceptron  to  classify  two  environmental  data 
sets  which  were  provided  by  AFTAC,  to  determine  the  salient  features  for  the  given  data 
sets,  and  to  evaluate  the  performance  of  each  classifier  produced. 

4.2  Methodology  Summary 

The  methodology  implemented  in  this  research  may  be  summarized  as  follows; 

1.  Select  the  architecture  parameters. 

2.  Select  the  training  parameters  for  the  given  data  using  Fallside  plots. 

3.  Train  the  classifier  tracking  the  test  set  error  and  select  an  appropriate  number  of 
training  epochs. 

4.  Perform  forward  sequential  selection  and  reduce  the  feature  set. 

5.  Train  the  classifier  again  using  the  reduced  feature  set. 

6.  Calculate  the  Bayes  error  bounds  and  evaluate  the  classifier  performance. 

4-3  Summary  of  Results 

4-3.1  Stainless  Steel.  The  stainless  steel  data  provided  by  AFTAC  consists  of 
196  input  features.  By  removing  the  homogeneous  features,  the  data  is  pared  down  to  50 
features.  Using  Fallside  plots,  the  learning  rate  and  momentum  for  the  backpropagation 
algorithm  used  to  train  on  this  data  are  both  chosen  as  .1.  By  tracking  the  test  set  error 
while  training  with  these  parameters,  an  adequate  training  time  is  determined  to  be  63 
epochs.  The  classifier  performance  at  this  point  is  6.8%  on  the  training  set  and  11%  on 
the  test  set.  Subsequently,  the  feature  set  is  reduced,  using  forward  sequential  selection, 


4-1 


to  34  input  features.  Retraining  on  this  reduced  data  set  results  in  a  final  error  on  the 
training  and  test  sets  of  .6%  and  5.5%,  respectively.  For  evaluation  purposes,  the  upper 
bound  on  Bayes  error  is  computed  over  ten  hidden  layer  nodes.  The  upper  bound  at  five 
hidden  layer  nodes,  which  is  the  number  of  hidden  layer  nodes  used  on  all  classifiers  in  this 
research,  is  11%. 

4-3.2  Actinide.  AFTAC  provided  a  single  actinide  data  set  labeled  with  al¬ 
phanumeric  descriptors  representing  the  locations  from  which  the  samples  were  taken.  By 
allowing  each  descriptor  to  represent  a  class,  55  classes  result.  Because  many  of  the  classes 
contain  very  few  samples,  the  class  structure  is  altered  to  allow  only  15  larger  classes. 
Therefore,  the  two  actinide  data  sets  are  identical  with  the  exception  of  the  class  struc¬ 
ture.  Each  contains  18  input  features. 

4-3.2. 1  55-class  Actinide.  As  with  the  steel  data,  the  training  parameters 
are  selected  using  Fallside  plots.  A  learning  rate  of  .1  and  a  momentum  of  .5  are  chosen. 
The  data  is  divided  into  training  and  test  sets  and  a  classifier  is  trained  using  these  pa¬ 
rameters.  The  training  set  error  approaches  57%,  while  the  test  set  error  approaches  59%. 
Because  this  level  of  error  is  unacceptable  for  real-world  application  and  it  is  not  clear 
that  further  training  will  improve  performance,  no  additional  analysis  is  conducted  with 
the  exception  of  bounding  Bayes  error.  The  upper  bound  on  Bayes  error  for  this  data  set 
is  63%  (at  5  hidden  layer  nodes). 

4-3. 2. 2  15-class  Actinide.  The  training  parameters  for  this  data  set  are 
a  learning  rate  and  momentum  of  .1.  Training  with  these  parameters  on  the  full  feature 
set  results  in  a  training  error  of  21%  and  a  test  error  of  26%.  During  forward  sequential 
selection,  the  classifiers  are  trained  for  3,000  epochs  each  and  the  feature  set  is  reduced 
from  18  to  9  features.  Upon  retraining,  the  test  set  error  is  unchanged,  while  the  training 
set  error  increases  to  24%.  The  Bayes  error  upper  bound  at  five  hidden  layer  nodes  is 
found  to  be  29%. 


4-2 


4-4  Conclusions 

The  following  conclusions  may  drawn  from  this  thesis: 

•  The  multi-layer  perceptron  is  capable  of  classifying  different  type  of  stainless  steel 
samples  with  a  high  degree  of  accuracy. 

•  The  multi-layer  perceptron  performs  as  well  as  any  other  classifier  when  classifying 
actinide  by  location,  given  the  data  set  provided.  The  overall  accuracy  of  the  multi¬ 
layer  perceptron  is  unacceptably  low  for  both  the  55-class  and  the  15-class  structures. 
However,  the  error  rates  are  within  the  upper  bound  of  Bayes  error  in  both  cases. 
Furthermore,  the  poor  performance  is  linked  to  the  number  of  classes  containing  few 
samples.  On  the  classes  containing  an  adequate  number  of  samples,  the  multi-layer 
perceptron  performed  well. 

•  The  number  of  features  used  in  classifying  the  given  environmental  data  can  be 
significantly  reduced  with  little  or  no  decrease  in  the  classification  accuracy. 

Considering  the  results  and  conclusions  presented  in  this  chapter  and  the  method¬ 
ology  outlined  throughout  this  work,  this  thesis  has  met  all  of  the  objectives  set  forth  in 
Section  1.4. 

4-5  Recommendations  for  Follow-on  Research 

The  possibilities  for  additional  research  related  to  this  thesis  are: 

1.  Develop  training  algorithms  capable  of  minimizing  by-class  error  and  overall  error 
simultaneously. 

2.  For  the  actinide  data,  gather  more  samples  in  the  less  populated  classes  and  perform 
the  analysis  outlined  in  this  research. 

3.  Measure  alternative  features  for  the  actinide  data  and  perform  the  analysis  outlined. 
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Appendix  A.  Backpropagation  Learning  Law  Derivations 
A.l  General  Learning  Law  Derivation 

For  the  MLP  to  be  trained  to  classify,  a  general  learning  law  for  each  set  of  weights 
can  be  derived  independent  of  the  type  of  transformation  function  applied  at  the  nodes. 
Backpropagation  is  the  most  common  technique,  and  the  one  used  here,  to  update  the 
matrix  of  weights,  W.  Backpropagation  requires  that  the  partial  derivative  of  the  error, 
E,  be  computed  with  respect  to  each  weight.  A  widely  accepted  measurement  of  the  error 
is  the  sum  squared  error,  defined  by: 


where  4  is  the  desired  output  and  h  is  the  actual  output.  If 
whose  elements  are  given  as: 


dE  , 

If  describes  a  matrix 

» lA/ 


dE\  dE 


(  \  _ 
\dwJii  ~ 


then  the  learning  law  for  each  set  of  weights  is  generally  written  as: 


W+  =  W 


'dw- 


where  is  the  updated  weight  set,  W  is  the  old  weight  set  and  g  is  the  learning  rate. 


In  the  next  two  sections,  the  weight  update  equations  for  a  two-layer  perceptron  are 


derived. 


A.J.J  Derivation  of  Output-Layer  Weight  Updates.  The  update  equations  for  the 
output  layer  are  considered  in  this  section  followed  by  a  section  containing  the  derivation 
of  the  update  equations  for  the  hidden  layer.  An  output-layer  weight  is  designated  by 
subscripts  j  and  k,  specifying  the  connection  between  the  node  in  the  hidden  layer  and 
the  k^^  node  in  the  output  layer.  Each  output  from  the  output  layer  is  designated  by 
From  the  generalized  form  of  the  learning  law,  the  updated  weight  is  established  as: 
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^tko  =  ^kko 
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dE 


where  the  error  E  for  the  two-layer  perceptron  is  defined  as: 

E=\  Y,(dk  -  h? 


k=l 


To  implement  the  update  equation,  the  partial  derivative  of  the  error,  E,  with  respect 
to  the  weight  Wj^k^  is  evaluated. 
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In  the  partial  derivative  of  the  summation  from  the  above  equation,  the  summation’s 
dependency  on  wj^ko  must  be  isolated. 

5 1;(4  -  A)*  =  \{(di  -  AJ*  +  •  •  ■  +  ( A.  -  A. )'  +  •  ■  •  +  (*  -  A)^ 

^  k=\  ^ 


Taking  the  partial  derivative  of  the  equation  above,  the  only  part  to  survive  the 
differentiation  will  be  variables  that  involve  both  subscripts  jo  and  This  reasoning 
identifies  the  terms  fk^  and  dko-  While  the  desired  output  dk^  is  a  constant,  the  actual 
output  /fcg  is  a  function  of  the  summation  of  weighted  outputs  from  the  hidden  layer.  The 
partial  derivative  of  the  summation  simplifies  to: 
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and  therefore: 
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where  fkoioi)  is  the  transformation  employed  at  node  ko.  By  substitution: 

d 

=  -(dko  -  fko  )fko  0^  E  fj 

~  ~  ( dko  ~  fko  )  fko  f 30 


The  notation  fj.^  represents  the  derivative  of  fkoi<^)  with  respect  to  a.  After  the  differen- 

J+i 

tiation,  the  argument  a  is  replaced  by  the  activation  function  ^  '^jkofj- 

3=1 

Therefore,  the  output-layer  weight  update  equation  reduces  to: 


^toko  =  ^Joko  +  -  fko)  fko  f 30 


(A.2) 


A. 1.2  Derivation  of  Hidden- Layer  Weight  Updates.  The  hidden-layer  weight  is 
designated  by  subscripts  i  and  j,  specifying  the  connection  between  the  node  in  the 
input  layer  and  the  node  in  the  hidden  layer.  Each  output  from  the  hidden  layer  is 
designated  by  fj.  Similar  to  the  updated  weight  for  the  output  layer,  the  hidden-layer 
updated  weight  is  established  as: 
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The  error  term  here  is  identical  to  that  discussed  in  the  last  section.  To  implement 
the  update  equation,  the  partial  derivative  of  the  error  E  must  be  evaluated  with  respect 
to  the  old  weight  Wi^j^  for  this  layer  rather  than  Wj^ko- 
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Because  the  partial  derivative  depends  only  on  the  links  containing  the  io  and  jo 
nodes  and  because  fk  is  a  function  of  the  summation  of  weighted  outputs  fj  from  the 
hidden  layer,  the  partial  derivative  of  the  error  term  is  given  as: 
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Substituting  this  along  with  the  equation  for  fk  from  the  previous  section,  the  fol¬ 
lowing  equation  for  the  partial  derivative  of  the  error  term  may  be  derived: 
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where  fj^  represents  the  derivative  of  /fc(a)  with  respect  to  a. 

The  output  of  a  hidden-layer  node  is  given  by  fjo{a)  where 
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Substituting  the  above  equation  for  fj^  and  realizing  that 
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Therefore,  the  hidden-layer  weight  update  equation  reduces  to: 
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A.  1.3  Conclusion.  In  sum,  the  general  equations  for  updating  the  output-layer 
and  hidden-layer  weights  are  provided  in  Equations  A.2  and  A. 3.  In  the  next  section,  the 
weight  update  equations  will  be  derived  for  MLP  structures  tailored  for  specific  transfor¬ 
mation  functions. 


A.2  Transformation  Function  Specific  Derivation  of  Learning  Laws 

A. 2.1  Transformation  Functions.  In  the  following  section,  learning  laws  for  both 
layers  will  be  derived  for  the  output  defined  for  the  sigmoid,  tanh,  and  linear  transforma¬ 
tions: 

.  SIGMOID:  fk,  = - 

.  TANH:  h,  =  tanh(ES  ^i^o/j); 

.  LINEAR:  =  E/i>j^o/j- 

The  combination  of  transformations  to  be  considered  at  the  output  and  hidden  layers 
is  shown  in  Table  A.l. 
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Table  A.l  Transformation  Function  Case  Table 


Layer 

Case 

Hidden 

Sigmoid 

Sigmoid 

Tanh 

Tanh 

Output 

Sigmoid 

Linear 

Tanh 

Linear 

A. 2. 2  Case  I:  Sigmoid-Sigmoid.  For  the  output  layer, 
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Now,  just  analyzing  the  partial  derivative  term  in  the  expression  above  yields  the 
following: 
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and  substituting  the  sigmoidal  transformation  function  for  fk^  in  the  derivative  yields: 
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dwj^ko 


Ej=i  ’">*'0  A 

~{dko  ~  fko)  f  (/jo) 

(l  +  e-E,=i 


dE 

^'^joko 


—  i^ko  fko)ifko){^  fko){fjo) 
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Therefore,  for  the  output  layer  of  the  Sigmoid-Sigmoid  Case: 


4oko  =  ^kko  +  -  Ao)(Ao)(l  -  Ao)(/io) 


Now,  the  hidden  layer  weights  must  be  updated. 


=  ^iok  -  % 


The  partial  derivative  term  of  the  above  expression  will  be  analyzed  as  before. 


(A.4) 


=  5-^(5  Ec*  -  fk?)  =  -  Ec*  - 

dwi^k  d^kk  dwi„. 


dwi^k 


substituting  the  sigmoidal  transformation  function  for  fk  in  the  derivative  yields: 


^=-E(4-A)^(i+e-i;s-/o- 


—  =  -  -  /.)(/.)(!  -  fk)^-—{-  E 


fc=l 


-  h)(h){i  -  /,)(-», 


substituting  the  sigmoidal  transformation  function  for  in  the  derivative  and  evaluating 
it  yields: 


JZ—  =  -  J2{dk  -  /fc)(A)(l  -  fk){-Wj„k){fk)i^  -  fk)i-^io) 

OWt030  k=l 
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Therefore,  for  the  hidden  layer  of  the  Sigmoid-Sigmoid  Case: 


K 


wf  ■  =w- 

tojo  ioJo 


+  V  -  fk)(fk)(l  -  fk)(-Wj,k)(fjo)(l  -  fjoK^io)  (A.5) 


fc=l 


A.2.S  Case  II:  Sigmoid- Linear.  For  the  output  layer, 


^toko  =  ^iofco  -  ^ 


dE 

dwj^ko 


Now,  just  analyzing  the  partial  derivative  term  in  the  expression  above  yields  the 
following. 


dE 


K 


dwjoko  dwj^ko  2  ^ 


dwjoko 


and  substituting  the  linear  transformation  function  for  fk^  in  the  derivative  yields: 


dE 

dWjoko 


J+1 


-  (4. 


dE 

dwjoko 


—  i^ko  fko){fjo) 


Therefore,  for  the  output  layer  of  the  Sigmoid-Linear  Case: 


^Iko  =  -  fko){fjo) 


(A.6) 


The  hidden  layer  weights  must  be  updated  as  shown  below. 


■  =  w-  ■  —  r? 

Jn7n  tnin  f 


dE 


10  JO  IQJO  '  . 

^  ^*0,70 
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The  partial  derivative  term  of  the  above  expression  will  be  analyzed  as  before. 


^  E(*  -  h?)  =  -  E(4  -  h)^ 


dwioh 


and  substituting  the  linear  transformation  function  for  fk  in  the  derivative  yields: 


dE 


K 


d 


J+1 


dw. 


^-Y^{dk-  Wjkfj} 


'030  k=l 


'030 


dE 


K 


d 


dw. 


—  =  -  E(*  - 


'030  k=l 


'030 


substituting  the  sigmoidal  transformation  function  for  fj^  in  the  derivative  and  evaluating 
it  yields: 


dE 


K 


dw 


—  ^2^dk  /fc)(^jo*)(/?o)(l  /io)(®io) 


'030  k=l 


Therefore,  for  the  hidden  layer  of  the  Sigmoid- Linear  Case: 


K 


■  —  w 

'030  '030 


fk)('^3ok){fjo)(^  f3o)(^'o) 

k=l 


(A.7) 


A. 2. 4  Case  III:  Tank- Tank.  For,  the  output  layer. 


^toko  =  ^Joko  -  ^ 


dE 


dWjoko 


Just  analyzing  the  partial  derivative  term  in  the  expression  above  yields  the  following. 


dE 


dwjoko  dwjoko  2  ^ 


dwjoko 
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and  substituting  the  hyperbolic  tangent  transformation  function  for  fk„  in  the  derivative 
yields: 


dE 

^^joko 


—  i^ko  fko) 


d 

9wjoko 


j+i 

{tanh  Wjkofj} 
i=i 


dE 

dwj^ko 


J+i 

=  -{dko  -  /fco)(cosh  Y  ‘^3kofj) 
i=l 


-2  g 


J+1 


i=i 


dE 


j+i 


Pi..,  =  -  /fco)(cosh  Y  ^jkofj)  ^(/jo) 

^'^Joko  ^ 


w 


J'+l 


kko  =  ^kko  +  -  /Ao)(cosh  Y  ^Jkofj)~\fk) 

i=i 


Therefore,  for  the  output  layer  of  the  Tanh- Tanh  Case: 


Kko  =  ^Joko  +  -  fko)il  -  ifko)  )(/jo) 


(A.8) 


The  hidden  layer  weights  must  be  updated  as  shown  below. 


ty'b  .  z=  w  -  —  77- 

tn7n  imn  ' 


dE 


«0J0  «0J0  '  Qyj 


<0J0 


The  partial  derivative  term  of  the  above  expression  will  be  analyzed  as  before. 


-  A)'}  =  -  EC*  -  A) 


dwi^k  d^kk  2 


jt=i  ^"^kk 


A- 10 


substituting  the  hyperbolic  tangent  transformation  function  for  fk  in  the  derivative  and 
evaluating  it  yields: 


dE 

dwi^jo 


K 


J+1 


J+1 


-  A)(cosh  Y,  ^jkfj)  ^0^  E 


k=l 


tojo  j-l 


dE 


K  J+1  Q  J+1 

-  ~  fk){cosh  Y  '^ikfjT^j- —  Y 

fc=i  j=i  j=i 


dE 

dwi^jo 


A  J  +  1  n 

-  Y^i^k  -  /fc)(cOsh  Y  ^jkfj)~Hwjok)-^ if  jo) 

k=l  j=l 


substituting  the  hyperbolic  tangent  transformation  function  for  fj^  in  the  derivative  yields: 

t/+l  /+1 

Wjkfj)~^{wjok){cosh 

j=i 


dE  ^  Q  ^+1 

-  /fc)(cosh  Y  ^jkfj)'\wjok){coshYwijoXi)-^-^ — 

Jt=i  ^=]  <=1  ,•=! 


dE 

^^iojo 


K  J+1  7+1 

-  Y(^k  -  /i:)(cosh  Y  Wjfc/i)"^(wioi;)(cosh  Y  ^ijo ) 

k=l  j=l  1=1 


K  J+1  7+1 

Kjo  =  Kjo  +  ^  -  A)(cosh  Y  Wjkfj)~^{wjok){cosh  Y  WijoXi)-^{xi,) 

k=l  j=l  i=l 

Therefore,  for  the  hidden  layer  of  the  Tanh-Tanh  Case: 

Kjo  =  Kjo  +  ^  -  (/fc)^)(^iofc)(l  -  (/jo)^)(a:.o)  (A.9) 

it=i 
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A. 2. 5  Case  IV:  Tanh-Linear.  For  the  output  layer,  the  learning  law  is  the  same 
as  the  learning  law  derived  earlier  in  Case  II  (Equation  A.6). 


=  ^jofco  +  -  fko)Ujo) 


For  the  hidden  layer  weights  the  derivation  is  provided  below. 


w'^  ■  =  w-  ■  -ri- 

tnin  in  In  ' 


di: 


«0  Jo  »0J0  '  Qyj 


«0J0 


The  partial  derivative  term  of  the  expression  above  is  analyzed  below. 

-  -f 5  -  A)^}  =  -  ECA  -  M 


dwiojo 


dwi^jo 


substituting  the  linear  transformation  function  for  fk  in  the  derivative,  evaluating  it,  and 
substituting  the  hyperbolic  tangent  transformation  function  in  the  resulting  derivative  for 
fjo  yields: 

dE  ^  Q  -^+1 

-  A)(^io*)^- — {tanh^ 

k=l  ,=l 

I 

K  I+l 

-  fk){Wj^k){cOsh  ^  Wij^Xi)-^{xi^) 

k=l  t=l 

Therefore,  for  the  hidden  layer  of  the  Tanh-Linear  Case: 

K 

Kh  =  ^  -  A)Kofc)(i  -  (4f  )(a;io)  (A.io) 

*=i 


dE 


dw{, 


SOJO 
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Appendix  B.  Matlab  Code 


B.l  removeh.m 

*/.  This  function  removes  the  homogenous  features  in  the  data  set 
*/,  "datain" .  The  reduced  data  set  is  "dataout" .  The  features 
7.  that  were  deleted  are  stored  in  "featsdel",  while  the  the 
7,  features  that  were  kept  axe  in  "featskept". 

7.  syntax:  [dataout  featsdel  featskept]  =removeh(datain) ; 

function  [dataout, featsdel, featskept]  =  removeh(datain) 

[row  col] =size (datain) ; 
target=datain( 1 , : ) ; 
f eatsdel==zeros(l,col) ; 
f eatskept==zeros(l,col) ; 
for  j=2:col, 

temp=f ind(datain( : , j)'=target(l, j)) ; 
if  isempty(temp) , 
featsdeld,  j)=l; 
else 

f  eatskeptd ,  j)  =  l; 
end; 
end; 

index=f ind(f eatskept) ; 

dataout= [datain ( : ,1)  datain( ; , index)] ; 

B.2  normal.m 

7.  This  function  performs  a  simple  normalization  of  the 
7.  data  "datain" . 

7.  syntax:  dataout=normal (datain)  ; 

function  [datain] =  normal (datain) 

[row  col]=size(datain); 
means=mean(datain) ; 
stds=std(datain) ; 
for  i=l:col 
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if  (stds(l,i)==0), 
stds(l,i)=le-15; 
end; 
end; 

for  i=l:row, 

datain(i, ; )=(datain(i, : )-means) ./stds; 
end; 

B.3  randchoose.m 

*/,  This  function  divides  the  data  set  "datain"  into  two 
*/,  subsets  "train"  eind  "test"  by  randomly  selecting 
'/■  members  by  class  for  each  subset, 
y.  syntax:  [train  test]=randchoose(train) 

function  [train .test] =randchoos e (datain) 
train=  □  ; 
test=[]  ; 

[row  col] =si2e (datain) ; 

Order=randperm(row) ; 
datain=datain(Order , : ) ; 
for  i=l :max(datain( : , 1)) , 

index=f ind(datain( : , l)==i) ; 

N=size(index, 1) ; 
if  N==l, 

train= [train ; datain ( index , : ) ] ; 
test=[test;datain(index, : )] ; 
elseif  N==2, 

train= [train ; datain ( index ( 1 ) , : ) ] ; 
test=[test;datain(index(2) , :)] ; 
else 

interval=2*f ix(N/3)  ; 

train= [train ; datain ( index( 1 : interval) , : )] ; 
test=[test;datain(index(interval+l:N) , : )] ; 
end; 
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end; 


B.4  fallside.m 

'/.  This  iiinction  creates  f allside  plots. 

*/,  sytntax:  [x  y]  =  fallside(datain, fid,  J,K,f  1,12,1, m,df, me) ; 

’/,  X  -  matrix  containing  the  classification  error  curves 
*/,  y  -  matrix  containing  the  SSE  curves 
'/,  datain  -  matrix  containing  the  input  data 

'/,  fid  -  file  id  for  the  log  file  (fid=l  writes  to  standard  output) 

'/,  J  -  number  of  hidden  nodes  for  the  neural  network  to  be  used 

'/,  K  -  number  of  output  nodes  for  the  neural  network  to  be  used 

'/,  (K  is  usually  equal  to  the  number  of  classes  in  the  data) 

'/,  fl  -  hidden  layer  transfer  function 
'/,  f2  -  output  layer  transfer  function 

’/,  1  -  vector  containing  the  initial  Ir,  the  Ir  increment,  and  the  final  Ir 

*/.  m  -  vector  containing  the  initial  me,  the  me  increment,  and  the  final  me 

*/,  df  -  number  of  epochs  between  writes  to  the  log  file 
'/,  me  -  max  number  of  epochs  for  training 

function  Ccecurves,ssecurves3  =  fallside(datain,fid, J,K,f l,f2,l,m,df ,me) ; 

'/,  Initialize  the  expected  output  matrix, t,  and  the  min, max  matrix  P 
t=[]; 

p=n; 

'/,  Create  the  t  matrix  based  on  the  data  classes  in  column  1  of  the  data  matrix 
t=zeros (datain(size(datain, 1) , 1) ,s ize (datain, 1)) ; 
for  i=l : size(datain, 1) , 
t(datain(i, 1) ,i)=l ; 

end; 

•/,  Remove  the  class  column  from  the  input  data  matrix 
datain=datain( : ,2:size(datain,2)) ; 

'/.  Transpose  the  matrix  to  for  use  with  Matlab  neural  network  functions 
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datain=datain' ; 


'/,  Initalize  the  neural  network  axchitecture 
I=size(datain,  1)  ;  '/.Number  of  input  nodes 

'/.  Initialize  the  errror  vectors  for  training,  testing  and  evaluation  classification  error 
cecurves= [] ; 
ssecurves=[] ; 

*/,  Initialize  the  weight  and  bias  matrices 
Wl=f evalCf 1 ,rand(J , I))  ; 

W2=feval(f2,rand(K,J)); 

Bl=f  eval(f  1  ,ran.d(J ,  1) ) ; 

B2=f evalCf 2,rand(K, 1)) ; 

Wlold=Wl; 

W2old=W2; 

Blold=Bl; 

B2old=B2; 

’/,  Let  learning  rate  and  momentum  vary  from  .1  to 
*/  .9  in  .2  increments 
for  lr=l(l) :1(2) :1(3), 
for  mc=m(l):m(2):m(3), 

'/,  Set  input  parameters  for  backpropagation  leciming 
TP= [df  me  0  Ir  1  1  me  1 . 04] ; 

'/,  Train  the  neural  network 

[W1  B1  M2  B2  TE  TR  CE]  =bpm(f id,Wl ,B1 ,f 1 , . . . 

W2,B2,f2,datain,t,TP)  ; 

TR=TR(1, :); 

Augment  the  ce  and  sse  vectors 
cecurves=[cecurves; [Ir  me  CE]]; 
ssecurves=[ssecurves; [Ir  me  TR]]; 

'/.  Reset  the  biases  and  weights  to  the  original  intial 

'/,  values  for  the  next  pass 

Wl=Wlold; 
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W2=W2old; 

Bl=Blold; 

B2=B2old; 

lprintf(’\n'); 

end; 

end; 

B.  5  post/,  m 

function  postf(datain, frame, ptype) 
for  i=l:5:25, 

ep3d2dxy(0: 1000, . 1 : .2: .9,datain(i:i+4,3: 1003) . . . 

, ’xlabel' , 'ylabel’ , . . . 

'zlabel’ , frame, ptype) ; 
end; 

B.6  bpm.m 

function  [¥l,bl,w2,b2,i,tr,ce]  =  bpm(fid,wl,bl,f I,w2,b2,f2,p,t,tp) 

y,  This  funtion  trains  a  MLP  via  backpropagation  with  momentum. 

y.  The  inputs  are  as  follows; 

y,  wl:  Matrix  of  hidden  layer  weights 

y,  bl ;  Matrix  of  hidden  layer  bias  weights 

y,  fl:  String  variable  denoting  hidden  layer  output  function 

y,  w2:  Matrix  of  output  layer  weights 

y,  b2:  Matrix  of  output  layer  bias  weights 

y,  f2:  String  variable  denoting  output  layer  output  fimction 
y.  p:  Matrix  of  input  data 

y,  t:  Matrix  containing  the  desired  output  vector  for  each 
y,  input  vector 

y,  tp:  Vector  containing  the  input  parameters  for  training 

y,  The  outputs  are  as  follows: 

y,  wl:  Hidden  layer  weights  after  training 

y,  bl :  Hidden  layer  bias  weights  after  training 

y,  w2:  Output  layer  weights  after  training 
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'/,  b2:  Output  layer  bias  weights  alter  training 

7,  Note  -  these  parameters  may  be  used  to  classify  data  using  the  "simulf"  command 

7,  i:  Number  of  epochs  trained 

7.  tr:  Matrix  of  sum  sqared  error  during  training 

7.  ce:  Vector  containing  classification  error  during  training 

7.  Note  -  this  function  is  a  modfied  version  of  the  "trainbpx"  function  which  is 
7.  part  of  the  neural  network  toolbox.  It  has  been  modified  to  include 

7.  classification  error  calculations 


7.  TRAINING 
df  =  tp(l) ; 
me  =  tp(2) ; 
eg  =  tp(3); 
Ir  =  tp(4); 
im  =  tp(5) ; 
dm  =  tp(6) ; 
me  =  tp(7) ; 
er  =  tp(8) ; 


PARAMETERS 

7.  Number  of  epochs  between  displays 
7<  Maximum  number  of  epochs 
7i  Threshold  Error 
7.  Learning  rate  or  eta 

%  Learning  rate  and  Momentum  adaptation  parameter  -  not  used  in 
7.  Learning  rate  and  Momentum  adaptation  parameter  -  not  used  in 
7i  Momentum  or  alpha 

7.  Error  ratio  -  not  used  in  this  version 


this  version 
this  version 


7.  Determine  the  delta  functions  for  each  layer 
dll  =  levaKfl , ’delta') : 

dl2  =  feval(f2, 'delta'); 

Initalize  the  weight  changes  to  zero 
dwl  =  wl*0;  dbl  =  bl*0; 
dw2  =  w2*0;  db2  =  b2*0; 

7i  Initialize  the  classification  error  vector 
ce=C]  ; 

7.  Calculate  the  initial  network  output 
al  =  leval(f 1 ,wl*p,bl) ; 
a2  =  Ieval(12,w2*al,b2) ; 

7.  Calculate  the  initial  sum  squared  error  aind  classification  error 
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e  =  t-a2; 

[temp  It]=max(t); 

[t  emp  I a2] =max ( a2 ) ; 

I=find(It==Ia2); 

ce=[ce  (l-size(I,2)/size(t,2))] ; 

SSE  =  suinsqr(e); 

’/,  Initialize  the  training  record 
tr  =  zeros(2,me+l) ; 
tr(l:2,l)  =  [SSE;  Ir] ; 

*/,  Print  the  ouput  message 

message  =  sprintf  ('TRAINBPX:  '/.*/.g/'/.g  epochs,  Ir  =  */,'/, g,  mc=*/.*/,g  SSE  =  */,*/, g\n’ , me) 

f pr inti (i id , message , 0 , Ir , me , SSE)  ; 

•/,  BACKPROPAGATION  PHASE 
'/,  Calculate  error  derivatives 
d2  =  Ieval(di2,a2,e) ; 
dl  =  feval(df I,al,d2,w2) ; 
for  i=l:me 

'/,  CHECK  PHASE 

if  SSE  <  eg,  i=i-l;  break,  end 
’/.  LEARNING  PHASE 

7,  Calculate  the  weight  changes  for  each  layer  according 
7i  to  the  backpropagation  learning  rule 
[dwl]  =  learnbpm(p,dl,lr,mc,dwl) ; 

[dw2]  =  Iearnbpm(al,d2,lr,mc,dw2) ; 
new_wl  =  wl  +  dwl;  new_bl  =  bl; 
new_w2  =  w2  +  dw2;  new_b2  =  b2; 

7.  PRESENTATION  PHASE 

7  Calculate  the  network  output  and  error 
new_al  =  f eval(f 1 ,new_wl*p,new_bl) ; 
new_a2  =  f eval(f 2,new_w2*new_al ,new_b2) ; 
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new_e  =  t-new_a2; 

new_SSE  =  sumsqr(ne¥_e) ; 

wl  =  new_wl;  bl  =  new_bl:  al  =  new_al; 

w2  =  new_w2:  b2  =  new_b2;  a2  =  new_a2; 

[temp  It]=max(t); 

[temp  I a2] =max ( a2 ) ; 

1=1 ind(It==Ia2) ; 

ce=[ce  (l-size(I,2)/size(t,2))] : 
e  =  ne¥_e;  SSE  =  new.SSE; 

•/,  BACKPROPAGATIQN  PHASE 

'/,  Calculate  the  derivative  of  the  error 

d2  =  feval(df2,a2,e) ; 

dl  =  feval(dfl,al,d2,w2) ; 

•/.  TRAINING  RECORD 
tr(:,i+l)  =  [SSE;  Ir] ; 
if  rem(i,df)  ==  0 

fprintf (fid, message, i,lr, me, SSE); 
end 
end; 

•/.  TRAINING  RECORD 
tr  =  tr(l:2,l:  (i+1)); 
if  rem(i,dl)  "=  0 

fprintf (fid, message, i,lr, SSE) ; 
end 

B.7  bpmte.m 

function  [wl,bl,w2,b2,i,tr,ce,tsce]  =  bpm(f id,wl,bl,f 1, . . . 
w2,b2,12,p,p2,t,t2,tp) 

'/,  This  funtion  trains  a  MLP  via  backpropagation  with  momentum. 

The  inputs  axe  as  follows: 

*/,  wl;  Matrix  of  hidden  layer  weights 

'/,  bl :  Matrix  of  hidden  layer  bias  weights 

'/,  fl:  String  variable  denoting  hidden  layer  output  function 

'/.  w2:  Matrix  of  output  layer  weights 
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*/.  b2:  Matrix  ol  output  layer  bias  weights 

'/.  12:  String  variable  denoting  output  layer  output  function 

*/,  p;  Matrix  of  input  data 

'/,  t:  Matrix  containing  the  desired  output  vector  for 
'/,  each  input  vector 

’/,  tp:  Vector  containing  the  input  parameters  for  training 
'/,  The  outputs  are  as  follows: 

'/,  wl:  Hidden  layer  weights  after  training 
'/,  bl :  Hidden  layer  bias  weights  after  training 
*/,  w2:  Output  layer  weights  after  training 
'/,  b2:  Output  layer  bias  weights  after  training 

7,  Note  -  these  parameters  may  be  used  to  classify  data  using  the  "simuff"  command 

7,  i:  Number  of  epochs  trained 

7.  tr:  Matrix  of  sum  sqared  error  during  training 

7.  ce:  Vector  containing  classification  error  during  training 

7.  Note  -  this  function  is  a  modfied  version  of  the  "trainbpx"  function  which  is 
7.  part  of  the  neural  network  toolbox.  It  has  been  modified  to  include 

7.  classification  error  calculations 

7.  TRAINING  PARAMETERS 

df  =  tp(l);  '!>  Number  of  epochs  between  displays 
me  =  tp(2);  Maximum  number  of  epochs 
eg  =  tp(3);  7.  Threshold  Error 
Ir  =  tp(4);  Learning  rate  or  eta 

im  =  tp(5);  '!>  Learning  rate  and  Momentum  adaptation  parameter  -  not  used  in  this  version 
dm  =  tp(6)  ;  Learning  rate  euid  Momentum  adaptation  parameter  -  not  used  in  this  version 
me  =  tp(7);  Momentum  or  alpha 

er  =  tp(8);  7.  Error  ratio  -  not  used  in  this  version 
Determine  the  delta  functions  for  each  layer 

dfl  =  f eval(fl , ’delta’)  : 
df2  =  feval(f2, ’delta’); 

7.  Initalize  the  weight  changes  to  zero 
dwl  =  wl*0;  dbl  =  bl*0; 


dw2  =  w2*0;  db2  =  b2*0; 

'/,  Initialize  the  classification  error  vectors  for  training,  test,  and  eval 

ce=C]; 

tsce=  □  ; 

*/,  Calculate  the  initial  network  output 
al  =  f eval(f l,wl*p,bl) ; 
a2  =  feval(f2,w2*al,b2) ; 

*/,  Calculate  the  initial  sum  squared  error  cind  classification  error 
e  =  t-a2; 

[temp  It]=max(t); 

[temp  I a2] =max ( a2 ) ; 

I=find(It==Ia2); 

ce=[ce  (l-size(I,2)/size(t,2))] ; 

[tsal  tsa2]=simuff (p2,wl,bl,f I,w2,b2,f2) ; 

[temp  It]=max(t2); 

[temp  la] =max(tsa2) : 

I=find(It==Ia): 

tsce=[tsce  (l-si2e(I,2)/size(t2,2))] ; 

SSE  =  sumsqr(e); 

*/,  Initialize  the  training  record 
tr  =  zeros(2,me+l) ; 
tr(l;2,l)  =  [SSE;  Ir] ; 

'/,  Print  the  ouput  message 

message  =  sprintf  ( ’TRAINBPX:  '/,7.g/'/.g  epochs,  Ir  =  ’/.'/.g,  mc='/,7,g  SSE  =  '/.7.g  trce=7,7,3 .  If  tsce=7.7.3 .  If \n' 

fprintf (f id,message,0,lr ,mc,SSE,ce(l)*100,tsce(l)*100) ; 

7.  BACKPROPAGATION  PHASE 
7i  Calculate  error  derivatives 
d2  =  feval(df2,a2,e) ; 
dl  =  fevaKdf  I,al,d2,w2) ; 
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for  i=l:me 
•/.  CHECK  PHASE 

if  SSE  <  eg,  break,  end; 

■/.  LEARNING  PHASE 

’/,  Calculate  the  weight  changes  for  each  layer  according 
'/,  to  the  backpropagation  learning  rule 
[dwl]  =  learnbpm(p,dl,lr,mc,dwl) ; 

[dw2]  =  Iearnbpm(al,d2,lr,mc,dw2); 
new_wl  =  wl  +  dwl;  new_bl  =  bl; 
new_w2  =  w2  +  dw2;  new_b2  =  b2; 

'/.  PRESENTATION  PHASE 

*/,  Calculate  the  network  output  and  error 

new_al  =  feval(fl,new_wl*p,new_bl); 

new_a2  =  f eval(f2,new_w2*new_al ,new_b2) ; 

new_e  =  t-new_a2; 

new_SSE  =  sumsqr(new_e) ; 

wl  =  new_wl;  bl  =  new_bl;  al  =  new_al; 

w2  =  new_w2;  b2  =  new_b2;  a2  =  new_a2; 

[temp  It]=max(t); 

[t  emp  Ia2]  =inax  (  a2  )  ; 

I=find(It==Ia2); 

ce=[ce  (l-size(I,2)/size(t,2))] ; 

[tsal  tsa2]=simuff (p2,wl,bl,f I,w2,b2,f2) ; 

[t  emp  It] =max ( 1 2 )  ; 

[temp  la] =max(tsa2) ; 

I=find(It==Ia); 

tsce=Ctsce  (l-size(I,2)/size(t2,2))]  ; 
e  =  new_e;  SSE  =  new_SSE; 

'/,  BACKPROPAGATION  PHASE 

*/,  Calculate  the  derivative  of  the  error 

d2  =  f eval(df2,a2,e) ; 

dl  =  feval(df I,al,d2,w2) ; 

'/,  TRAINING  RECORD 
tr(:,i+l)  =  [SSE;  Ir]  ; 
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il  rem(i,df)  ==  0 

Iprintf (fid, message, i,lr, me, SSE,ce(i) *100, t see (i) *100) ; 
end 
end; 

•/.  TRAINING  RECORD 
tr  =  tr(l:2,l: (i+1)) ; 
if  rem(i,df)  "=  0 

fprintf (fid, message, i,lr,me,SSE,ee(i) *100, t see (i)* 100) ; 
end 

B.8  tev.m 

funetion  [TRCEtot,TSCEtot]=tev(train, test, fid, J,K, . . . 
f l,f2,lr,me,df ,me) 

*/,  This  funetion  ealeulates  the  elassif ieation  error 
'/,  for  the  test  and  evaluation  sets  during  training. 

*/,  This  is  aeeomplished  numerous  times  to  eompute  the 
•/,  95'/,  eonfidenee  bounds  for  the  mean  error.  The 

number  of  iterations  is  speeified  by  "passes" 

•/. 

'/,  syntax  :  [x  y3=tev(inf  ile, fid, N,J,K,fl,f2, Ir, mo, df, me, passes)  ; 

*/,  X  -  matrix  oontaining  mean  training  set  elassif  ieation  error 

*/,  curve  with  95'/  confidence  bounds 

'/,  row  1  -  lower  bound 

'/,  row  2  -  mean 

'/,  row  3  -  upper  bound 

'/,  y  -  matrix  containing  mean  test  set  elassif  cation  error  curve  w/95'/, 
'/,  confidence  bounds 

I  z  -  matrix  containing  mean  eval  set  elassif  cation  error  curve  w/95'/, 

'/,  confidence  bounds 

'/,  infile  -  name  of  the  file  to  be  used  for  reindom  data  selection 

'/,  fid  -  file  id  for  the  log  file  (fid=l  writes  to  standard  output) 

'/,  J  -  number  of  hidden  nodes  for  the  neural  network  to  be  used 

'/,  K  -  number  of  output  nodes  for  the  neural  network  to  be  used 

'/,  (K  is  usually  equal  to  the  number  of  classes  in  the  data) 
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'/,  fl  -  hidden  layer  transfer  function 
’/,  f2  -  output  layer  transfer  function 
*/,  Ir  -  learning  rate 
'/,  me  -  momentum 

*/,  df  -  number  of  epochs  between  writes  to  the  log  file 
'/.  me  -  max  number  of  epochs  for  training 

'/,  Initialize  the  test  and  eval  error  matrices 
TSGEtot=C] ; 

TRCEtot=  □  ; 

'/.  Initialize  the  desired  output  matrices  for  each  data  set 
t=zeros(train(size(train, 1) , 1) ,size(train, 1)) ; 
t2=zeros(test(size(test ,1) , 1) , size(test , 1)) ; 

fprintf (fid, ’Creating  t  matrix. . An’) ; 

'/,  Determine  the  desired  output  matrices  for  each  data  set 
for  i=l : size(train,l) , 
t(train(i,l) ,i)=l: 

end; 

for  i=l : size(test , 1) , 
t2(test(i,l) ,i)=l; 

end; 

'/,  Remove  the  class  information  from  each  data  set  and 
*/,  transpose  each  data  set 
train=train( : ,2 : size (train, 2)) ; 
train=train’ ; 

test=test(: ,2 : size(test ,2)) ; 
test=test’ ; 

y,  Initalize  neural  network  parameters 
I=size(train,  1)  ;  ‘/iNumber  of  input  nodes 
Wl=feval(f l,rand(J,I)) ; 

W2=f eval(f2,rand(K, J))  ; 

Bl=feval(fl,rand(J,l)) ; 
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B2=leval(f2,rand(K,l)) ; 

TP=Cdf  me  .02  Ir  1  1  me  1.04]; 

’/,  Train  the  neural  network 

[W1  B1  W2  B2  TE  TR  CE  TSCE]=bpmte(lid, Wl.Bl, 11, W2,B2.f2, train, test, t,t2, TP) ; 
save  weights  Ml  B1  W2  B2; 

’/,  Augment  the  classification  error  lor  the  current  run  with  those  ol  previous  runs 
TSCEtot=CTSCEtot;TSCE] ; 

TRCEtot= [TRCEtot ; CE] ; 


B.9  fselct.m 

function  Cleats, Eperleat]=lselct(datain, fid, numfeats,J,K, 11, 12, . . . 
lr,mc,df ,me) 

*/,  This  function  performs  forward  sequential  feature  selection 

•/. 

*/,  syntax: 

’/.  Cfeats,Eperfeat]=lselct(inlile,lid,numleats,  J,K,  11,12,  Ir  f me  ^  ^  m @  ^ 
'/,  feats  -  vector  containing  the  prioritized  features 

'/,  Eperleat  -  vector  containing  the  error  as  each  feature  is 
'/.  added  to  the  nucleus 

numleats  -  the  number  of  features  to  select  out  ol  the  total 
'/,  inlile  -  mat  filename  specifies  data  to  use 

*/,  lid  -  file  id  lor  the  log  file  (fid=l  writes  to  steindard  output) 

*/.  J  -  number  of  hidden  nodes 

'/.  K  -  number  ol  output  nodes  for  the  neural  network  to  be  used 
'/,  (K  is  usually  equal  to  the  number  of  classes  in  the  data) 

'/,  11  -  hidden  layer  transfer  function 
*/,  12  -  output  layer  transfer  function 
'/.  Ir  -  learning  rate 
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*/.  me  -  momentum 

'/,  dl  -  number  of  epochs  between  writes  to  the  log  file 
*/.  me  -  max  number  of  epochs  for  training 


•/.  Initialize  vectors 
nucleus=[3 ; 

Eperfeat=[3  ; 
errors=[]  ; 
t=C]; 

fprintf (fid, 'Building  t  Matrix. . An’) ; 
t=zeros(datain(size(datain, 1) ,1) ,size(datain,l)) ; 
for  i=l ; size(datain, 1) , 
t(datain(i, 1) ,i)=l; 
end; 

datain=datain( :  ,2:size(datain,2)) ; 
datain=datain’ ; 

Crow  col]=size(datain); 
feats=C]  ; 
nucleus=C3  : 

availf eat s = 1 : row ; 

W2=feval(f2,rand(K, J)) ; 

Bl=f  evalCf  1  ,rand(J,  D) ; 

B2=f eval(f 2,rand(K, 1)) ; 

Wlold=[]  ; 

W2old=W2; 

Blold=Bl; 

B2old=B2; 

fprintf (fid, ’Performing  feature  selection. . .\n’ ) ; 
while  size(f eats,2)<numfeats, 
fprintf (fid, ’\n’); 
errors= [] ; 

Wl=[Wlold  feval(fl,rand(J,l))3 ; 

Wlold=Wl; 

for  z=l : size(availf eats ,2)  , 
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nucleus= [feats  availfeats(l,z)] ; 
newdata=datain(nucleus, : ) ; 

I=size(newdata,l);  */, Number  of  input  nodes 
Wl=Wlold; 

W2=W2old; 

Bl=Blold; 

B2=B2old; 

TP= [df  me  . 02  Ir  1  1  me  1 . 04] ; 

[W1  B1  W2  B2  TE  TR  CE]=bpm(0,Wl.Bl.fl,W2,B2,f2.newdata,t,TP); 
errors= [errors  CE(l,size(CE,2))] ; 
fprintf (fid, 'Nucleus :  ’); 
fprintf  (f  id, '  '/.g' , nucleus ) ; 
fprintf  (fid,  ’  ce=  '/,g  \n’ ,CE(1  ,size(CE,2)))  ; 
end; 

[temp  mini] =min( errors ) ; 

Eperf eat=[Eperfeat  errors(l ,minl)] ; 
feats= [feats  availf eats (1, mini)] ; 
if  minl==size(availfeats,2) , 

availf eats=availf eats (1 , 1 :minl-l) ; 
else, 

availf eats= [availf eats (1, l;minl-l)  availf eats (minI+1: size (availf eats, 2))] ; 
end; 
end; 


B.IO  bbayes.m 

function  [Upper] =Bbayes (datain , f id , Nodes ,K , f 1 , f 2 , Ir , me , df , me) 

’/,  This  sepript  estimates  bounds  the  Bayes  Error  rate 
’/,  Upper  -  vector  containing  upper  Bayes  bound 
*/,  datain  -  matrix  containing  the  input  data 

*/,  fid  -  file  id  for  the  log  file  (fid=l  writes  to  standard  output) 
'/.  Nodes  -  max  number  of  hidden  nodes 

*/,  K  -  number  of  output  nodes  for  the  neural  network  to  be  used 
'/,  (K  is  usually  equal  to  the  number  of  classes  in  the  data) 

*/,  fl  -  hidden  layer  transfer  function 
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'/,  f2  -  output  layer  transfer  function 
7,  Ir  -  learning  rate 

me  -  momentum  ^ 

7,  df  -  number  of  epochs  between  writes  to  the  log  file 

7.  me  -  max  number  of  epochs  for  training 

7.  Initialize  vectors 

Upper=  []  ; 

t=0; 

fprintf (fid, ’Building  t  Matrix. . An’) ; 

y.  Create  the  t  matrix  based  on  the  data  classes  in  column  1  of  the  data  matrix 
t=zeros(datain(size(datain, 1) , 1) ,size(datain, 1)) ; 
for  i=l : size(datain, 1) , 
t(datain(i, 1) ,i)=l; 

end; 

fprintf  (fid,  ’Normalizing  Data. .  .\n’)  ; 
datain=datain( : ,2:size(datain,2)) ; 
datain=datain’ ; 

I=size(datain,  1) ;  7.Number  of  input  nodes 

fprintf (fid, ’Calculating  Upper  BoundNn’); 

B2=feval(f2,rand(K,l)); 

Wlold=n; 

W2old=n ; 

Blold=n  ; 

B2old=B2; 
for  J=l:Nodes, 

fprintf  (fid, ’Number  of  Nodes  =  7tg  \n’,J); 
misses=0; 

Wl=[Wlold;  feval(fl,rand(l,I))3; 

W2=CW2old  feval(f2,rand(K,l))] ; 

Bl=[Blold;  feval(fl,rand(l,l))]; 

B2=B2old; 

Wlold=Wl; 

W2old=W2; 
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Blold=Bl: 

for  L=1 : size(datain,2) ; 
fprintf  (fid, 'Sample  Left  Out  =  '/,g  \n’,L); 

newsteel=[dataiii( : ,  1;  (L-1))  datainC : ,  (L+1)  :size(dataiii,2))]  ; 

neHt=[t( : , 1 : (L-1))  t( : , (L+1) :siza(t ,2))] ; 

sample=datain( : ,L) ; 

samplet=t( ; ,L) ; 

wi=wioid: 

W2=W2old; 

Bl=Blold; 

B2=B2old; 

TP=Cdf  me  .02  Ir  1  1  me  1.04]; 

[W1  B1  W2  B2  TE  TR  CE]=bpm(fid,Wl,Bl,fl,W2,B2,f2, . . . 
newsteel.newt ,TP) ; 

[al  a2]=simuff (sample, Wl,Bl,fl, W2,B2,f 2); 

[temp  It]=max(samplet) ; 

[t  emp  I a2] =max ( a2 ) ; 
if(It*=Ia2), 

misses=misses+l; 

end; 

clear  W1  W2  B1  B2; 
end; 

Upper= [Upper  misses/size(datain,2)] ; 
end; 

B.ll  ep3d2dxy.m 

function  ep3d2dxy (x , y , z , xstr , ystr , zstr .makesqTF , logyTF) ; 

’/,  EP3D2DXY  Handy  utility  for  printing  3D  data  onto  2D  plot 
*/,  ep3d2dxy(x,y,z, xstr, ystr, zstr, makesqTF, logyTF) ; 

•/. 

'/,  by:  Capt  Toby  Reeves,  Capt.  Edward  M.  Ochoa,  GE0-96D 


x=x( : ) ; 

y=y(:); 
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figure 
if  "logyTF 

h=plot(x,z(l, : 
hold  on 

for  i=2:size(z,l) 

h=[h:plot(x,z(i, :)')]; 

end 

else 

h=semilogy(x,z(l, : 
hold  on 

for  i=2:size(z,l) 

h=Ch;semilogy(x,z(i, ;)')]; 

end 

end 

hold  off 

style=str2mat( 

color=str2mat( ’y ' , ’m’,'c’,’r','g’,’b'): 

slen=size(style, 1) ; 

clen=size(color, 1) ; 

linewidth=.5; 

for  i=l:length(h) 

set(h(i) , 'LineStyle' ,style(rem(i-l,slen)+l, :)) 
if  (rem(i,4)  I  'rem(i,4)), 
linewidth=lineHidth+ . 25 ; 
linecolr=color(rein(i-l,clen)+l, :) ; 
end 

set(h(i) , 'LineWidth' .linewidth) 
set(h(i) , 'Color’ .linecolr) 
end 

ylabel(zstr) 
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xlabel(xstr) 

'/.titleCsprintf  (' e/.s,  '/,s)  vs.  '/.s’ ,xstr,ystr,zstr)) 
title(’title’) ; 

if  makesqTF 
axis  squcire 

end 

lgnd=  []  : 

Ignd  =  ntiin2str(y(l)) ; 
for  i=2:size(y,l) 

Ignd  =  str2mat(lgnd,num2str(y(i))) ; 

end 

h=l egend (Ignd , - 1 ) ; 
axes(h) ; 

title(sprintf  (’’/.s’  ,ystr)) 

B.12  errorbars.m 

function  [devs .means] =errorbars (datain , f eatur es , classes) 
'/,  syntax  [devs  means]=errorbars(datain, features, classes) 
devs=[]  ; 
means  =[]  ; 

for  i=l :max(datain( : , 1)) , 

index=f ind(datain( : , l)==i) ; 
if  size(index,l)'=l, 

temp=std(datain(index, :)); 
devs= [devs; temp] ; 

means= [means ;mean(datain(index, :))] ; 
else 

devs=[devs;zeros(l ,size(datain,2))] ; 
means= [means ;datain(index, ; )] ; 
end; 
end; 
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for  i=l : size(features,2) , 
figure; 

for  j=l : size(classes,2) , 

m=means (classes (l,j) ,f eatures(l,i)+l) ; 
s=devs(classes(l, j) ,f eatures(l,i)+l) ; 
y=m-s:2*s/10:m+s; 
x=ones(l,size(y,2))*classes(l,j) ; 
plot(classes(l , j) ,m-s, ’w+’) ; 
hold  on; 

plot(classes(l, j) ,m+s, ’w+’) ; 
plot (classes (l,j) ,m, ’r*’ ) ; 
plot(x,y) 
end; 

xlabel( ’ Class ‘ ) ; 

title(C'Mean  and  Standard  Deviation  of  Feature  '  int2str(features(l,i)) 
’  vs.  Class’]); 
end; 

B.13  removec.m 

function  [dataout .sizes] =removec(datain,N) 

'/,  syntax:  [dataout  class_sizes]  =removec(datain,  Target_N_per_class) ; 

dataout  =  []  ; 

cnt=0; 

classes=datain( : , 1) ; 
f eatures=datain( ; ,2:size(datain,2)) ; 
sizes=  []  ; 

for  i=l :max(classes) , 
index=f ind(classes==i) ; 
sizes=Csizes  size(index, 1)] ; 
if  s ize( index, 1)  >  N, 
cnt=cnt+l ; 

dataout= [dataout; [ones(size(index,l) ,l)*cnt  features (index, :)]] ; 

end; 

end; 
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Appendix  C.  Elemental  Symbols 


Element 

Symbol 

Element 

Symbol 

Lithium 

Li 

Antimony 

Sb 

Beryllium 

Be 

Tellurium 

Te 

Boron 

B 

Iodine 

I 

Carbon 

C 

Polonium 

Po 

Nitrogen 

N 

Cesium 

Cs 

Oxygen 

0 

Barium 

Ba 

Fluorine 

F 

Lanthanum 

LcL 

Dysprosium 

Dy 

Cerium 

Ce 

Sodium 

Na 

Praseodymium 

Pr 

Magnesium 

Mg 

Neodymium 

Nd 

Aluminum 

A1 

Promethium 

Pm 

Silicon 

Si 

Samarium 

Sm 

Phosphorus 

P 

Europium 

Eu 

Sulfur 

S 

Gadolinium 

Gd 

Chlorine 

Cl 

Terbium 

Tb 

Potassium 

K 

Holmium 

Ho 

Calcium 

Ca 

Erbium 

Er 

Scandium 

Sc 

Thulium 

Tm 

Titanium 

Ti 

Ytterbium 

Yb 

Vanadium 

V 

Lutetium 

Lu 

Chromium 

Cr 

Hafnium 

Hf 

Manganese 

Mn 

Tantalum 

Ta 

Iron 

Fe 

Tungsten 

W 

Cobalt 

Co 

Rhenium 

Re 

Element 

Symbol 

Element 

Symbol 

Nickel 

Ni 

Osmium 

Os 

Copper 

Cu 

Iridium 

Ir 

Zinc 

Zn 

Platinum 

Pt 

Gallium 

Gel 

Gold 

Au 

Germanium 

Ge 

Mercury 

Hg 

Arsenic 

As 

Thallium 

T1 

Selenium 

Se 

Lead 

Pb 

Bromine 

Br 

Bismuth 

Bi 

Rubidium 

Rb 

Astatine 

At 

Strontium 

Sr 

Californium 

Cf 

Yttrium 

Y 

Francium 

Ft 

Zirconium 

Zr 

Radium 

Ra 

Niobium 

Nb 

Actinium 

Ac 

Molybdenum 

Mo 

Thorium 

Th 

Technetium 

Tc 

Protactinium 

Pa 

Ruthenium 

Ru 

Uranium 

U 

Rhodium 

Rh 

Neptunium 

Np 

Palladium 

Pd 

Plutonium 

Pu 

Silver 

Ag 

Americium 

Am 

Cadmium 

Cd 

Curium 

Cm 

Indium 

In 

Berkelium 

Bk 

Tin 

Sn 
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Appendix  D.  Fallside  Plots 
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D.l  Steel 


Figure  D.l  5iee/ Fallside  plots:  classifii 
over  a  range  of  a  values. 
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(Epoch,  q)  versus  Classification  Error  (77=. 5) 

80, - 1 - . - . - i - 


70  :  i 

a 


Epoch 

(b) 

Figure  D.2  Steel  Fallside  plots:  classification  error  versus  epoch  for  (a)r7=.5  and  (b)77=.7 
over  a  range  of  a  values. 
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Figure  D.3 


(Epoch,  a)  versus  Classification  Error  (77=. 9) 


Steel  Fallside  plots:  classification  error  versus  epoch  for  t]=.9  over  a  range 
of  a  values. 
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Epoch 


Figure  D.6  Actinidei  Fallside  plots;  classification  error  versus  epoch  for  7?=.9  over  a 
range  of  a  values. 
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(Epoch,  a)  versus  Classification  Error  (77=. 5) 


Figure  D.8  Actinide2  Fallside  plots:  classification  error  versus  epoch  for  (a)77=.5  and 
(b)77=:.7  over  a  range  of  a  values. 
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Epoch 

Figure  D.9  Actinide2  Fallside  plots:  classification  error  versus  epoch  for  7?=.9  over  a 
range  of  a  values. 
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Appendix  E.  Actinide  Classes 


Class 

Location 

Class 

Location 

1 

A-1 

29 

M-11 

2 

A-2 

30 

M-12 

3 

A-3 

31 

M-13 

4 

A-4 

32 

M-14 

5 

A-5 

33 

M-15 

6 

A-6 

34 

M-16 

7 

A-7 

35 

M-17 

8 

A-8 

36 

M-18 

9 

B 

37 
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